Maria Teresa Pazienza1, Marco Pennacchiotti2, Armando Stellato1

Slides:

Advertisements

Similar presentations

KEOD 2013 – 20 th September 2013 A Comprehensive Framework for Semantic Annotation of Web Content Manuel Fiorelli 1, Maria Teresa Pazienza 2, Armando Stellato.

Advertisements

A Stepwise Modeling Approach for Individual Media Semantics Annett Mitschick, Klaus Meißner TU Dresden, Department of Computer Science, Multimedia Technology.

Haystack: Per-User Information Environment 1999 Conference on Information and Knowledge Management Eytan Adar et al Presented by Xiao Hu CS491CXZ.

AHRT: The Automated Human Resources Tool BY Roi Ceren Muthukumaran Chandrasekaran.

Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.

WebRatio BPM: a Tool for Design and Deployment of Business Processes on the Web Stefano Butti, Marco Brambilla, Piero Fraternali Web Models Srl, Italy.

Francesca Fallucchi, Noemi Scarpato,Armando Stellato, and Fabio Massimo Zanzotto DISP, University “Tor Vergata” Rome, Italy

OntoBlog: Linking Ontology and Blogs Aman Shakya 1, Vilas Wuwongse 2, Hideaki Takeda 1, Ikki Ohmukai 1 1 National Institute of Informatics, Japan 2 Asian.

BTW (“By The Way…”) Information Annotation By Rudd Stevens, Jason Endo University of San Francisco.

Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.

A System for A Semi-Automatic Ontology Annotation Kiril Simov, Petya Osenova, Alexander Simov, Anelia Tincheva, Borislav Kirilov BulTreeBank Group LML,

Shared Ontology for Knowledge Management Atanas Kiryakov, Borislav Popov, Ilian Kitchukov, and Krasimir Angelov Meher Shaikh.

Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding.

WebRatio BPM: a Tool for Design and Deployment of Business Processes on the Web Stefano Butti, Marco Brambilla, Piero Fraternali Web Models Srl, Italy.

Overview of Search Engines

Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.

In The Name Of God. Jhaleh Narimisaei By Guide: Dr. Shadgar Implementation of Web Ontology and Semantic Application for Electronic Journal Citation System.

Search Engines and Information Retrieval Chapter 1.

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.

Survey of Semantic Annotation Platforms

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.

1 Technologies for (semi-) automatic metadata creation Diana Maynard.

CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”

© DATAMAT S.p.A. – Giuseppe Avellino, Stefano Beco, Barbara Cantalupo, Andrea Cavallini A Semantic Workflow Authoring Tool for Programming Grids.

FP WIKT '081 Marek Skokan, Ján Hreňo Semantic integration of governmental services in the Access-eGov project Faculty of Economics.

11 CORE Architecture Mauro Bruno, Monica Scannapieco, Carlo Vaccari, Giulia Vaste Antonino Virgillito, Diego Zardetto (Istat)

Benchmarking ontology-based annotation tools for the Semantic Web Diana Maynard University of Sheffield, UK.

Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy August 30, 2008 Speaker : Sahana Chiwane.

Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.

DataBase and Information System … on Web The term information system refers to a system of persons, data records and activities that process the data.

User Profiling using Semantic Web Group members: Ashwin Somaiah Asha Stephen Charlie Sudharshan Reddy.

Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.

Web Design and Development. World Wide Web  World Wide Web (WWW or W3), collection of globally distributed text and multimedia documents and files 

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

Of 24 lecture 11: ontology – mediation, merging & aligning.

CS 501: Software Engineering Fall 1999 Lecture 23 Design for Usability I.

Personalized Ontology for Web Search Personalization S. Sendhilkumar, T.V. Geetha Anna University, Chennai India 1st ACM Bangalore annual Compute conference,

Language Identification and Part-of-Speech Tagging

Linux Standard Base Основной современный стандарт Linux, стандарт ISO/IEC с 2005 года Определяет состав и поведение основных системных библиотек.

WP4 Models and Contents Quality Assessment

User Characterization in Search Personalization

Cloud based linked data platform for Structural Engineering Experiment

GeneXus 9.0: Web applications at their higher power

Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin

Search Engine Architecture

Xiaogang Ma, John Erickson, Patrick West, Stephan Zednik, Peter Fox,

Presented by: Hassan Sayyadi

Web Engineering.

Textbook Engineering Web Applications by Sven Casteleyn et. al. Springer Note: (Electronic version is available online) These slides are designed.

Chapter 12: Automated data collection methods

European Network of e-Lexicography

Social Knowledge Mining

Building an Integrable XBRL Portal Daniel Hamm German Central Bank

Submitted By: Usha MIT-876-2K11 M.Tech(3rd Sem) Information Technology

Presentation 王睿.

Network Profiler: Towards Automatic Fingerprinting of Android Apps

Searching and browsing through fragments of TED Talks

Searching with context

Block Matching for Ontologies

Magnet & /facet Zheng Liang

Measuring Complexity of Web Pages Using Gate

CS246: Information Retrieval

Search Engine Architecture

Guided Research: Intelligent Contextual Task Support for Mails

Web Mining Research: A Survey

Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University

Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.

Presentation transcript:

A Web Browser Extension for growing-up Ontological Knowledge from Traditional Web Content Maria Teresa Pazienza1, Marco Pennacchiotti2, Armando Stellato1 1 University of Rome, Tor Vergata {pazienza, stellato}@info.uniroma2.it 2 Saarland University pennacchiotti@coli.uni-sb.de

Outline Objectives Semantic Turkey: a Semantic Bookmarking tool Semantic Turkey Architecture Semantic Turkey Main Functionalities Extending Semantic Turkey: Ontology Learning Learning Ontological Content from Tables Learning Semantics Relation from Text Future Work 12/01/2019 Armando Stellato stellato@info.uniroma2.it ai-nlp.info.uniroma2.it/stellato

Objectives Turn out the usual tool for Web Navigation, the Web Browser, into a mean for: collecting information from web pages, be it: Domain terminology Factual information (objects) organizing collected content to: create a new ontology and/or to extend existing ones with new axioms populate ontologies with new instance data Main contribution Unify worlds of: traditional ontology editing (Protege, TopBraid Composer etc…) Semantic annotation (Melita, Gate, Magpie, Annotea) To give life to a unique environment for knowledge acquisition and management Requirements Extendible architecture Easy-to-perform knowledge acquisition process Robustness wrt different web technologies 12/01/2019 Armando Stellato stellato@info.uniroma2.it ai-nlp.info.uniroma2.it/stellato

A Semantic Bookmarking tool Semantic Turkey A Semantic Bookmarking tool

Semantic Turkey Objective for improving the Web Navigation Experience Focused on the “I’ve already seen X somewhere else in the Web, but…where?” problem: Did I keep track of X? If yes, where did I put the link to a web document about X? In which folder of my bookmarks should I check for presence of these links and, will I recognize them from their name with a short glimpse at my bookmarks? Our approach Obtain a clear separation between pure knowledge data (the WHAT) and web links (the WHERE) Offer innovative navigation of both the acquired information and of the pages where it has been collected 12/01/2019 Armando Stellato stellato@info.uniroma2.it ai-nlp.info.uniroma2.it/stellato

Semantic Bookmarking: Requirements and Design Goals capturing information from web pages, both by considering the pages as a whole, as well as by annotating portions of their text Editing of a personal ontology for categorization of the annotated information and, possibly, to exchange data with other users Navigation of the structured information as an underlying semantic net, with links to the web sources where it has been annotated Clear separation between business model and user interface 12/01/2019 Armando Stellato stellato@info.uniroma2.it ai-nlp.info.uniroma2.it/stellato

Semantic Turkey Architecture Three layered architecture Presentation Layer An extension to the Firefox browser. The User Interface has been created through a combined use of the XUL, XBL and Javascript technologies Services Layer Enables communication between the client (Firefox browser extension) and the ontology persistence layer. Deployed as services which may be invoked through http requests submitted according to the Ajax paradigm Persistence Layer Access to ontological knowledge. Based on dedicated ontology API, which can be implemented through use of different technologies. 12/01/2019 Armando Stellato stellato@info.uniroma2.it ai-nlp.info.uniroma2.it/stellato

Knowledge Model Application Layer Contains ontologies needed by the application for coordinating and organizing its services These ontologies are hidden by default from the user (their schema and related content can be shown for administrative purposes) In the core version of ST, it includes the Semantic Annotation ontology, which provides concepts and relations for keeping track of user semantic bookmarks, like: SemanticAnnotation Document WebPage and the required properties for relating the instances User/Domain Layer ST is now as an (almost) complete ontology editing tool, with functionalities for importing ontologies from the web, creating local caches, editing new ontologies by adding concepts, instances, instantiating attributive (datatype) or relational (object) properties etc… new objects can be added independently from semantic annotations. 12/01/2019 Armando Stellato stellato@info.uniroma2.it ai-nlp.info.uniroma2.it/stellato

Semantic Turkey in Action: Semantic Annotation 12/01/2019 Armando Stellato stellato@info.uniroma2.it ai-nlp.info.uniroma2.it/stellato

Semantic Turkey in Action: Semantic Annotation No automatic ontology building from text but… with just one intuitive drag’n’drop operation (and few HC interactions), the system: Creates a new Domain Object instance (and/or builds a new lexicalization for the already existing instance on the annotate page) Creates a new SemanticAnnotation instance Creates a new WebPage instance Relates all of them through dedicated properties …(depending on the specific operation) 12/01/2019 Armando Stellato stellato@info.uniroma2.it ai-nlp.info.uniroma2.it/stellato

Semantic Turkey in Action: Ontology Editing 12/01/2019

Semantic Turkey in Action: Semantic Navigation 12/01/2019 Armando Stellato stellato@info.uniroma2.it ai-nlp.info.uniroma2.it/stellato 12

Automating the Turkey… Can we speed-up ontology building by (semi)-automatically learning ontological content from web pages? Ontology learning from text is a rich area of NLP [Buitelaar and Cimiano,2008] We need to adapt classical methods, in order to comply to the Turkey’s requirements: Low computational cost (no deep parsing and complex algorithms) Easy-to-useness Focus on web content Two ling modules : (1) ontology learning form tables (2) relation extraction from texts FINE ARMANDO 12/01/2019 Armando Stellato stellato@info.uniroma2.it ai-nlp.info.uniroma2.it/stellato

Learning ontological content from tables INIZIO MARCO

Web Tables A preferential way to convey knowledge on the Web Contain dense meaningful knowledge Highly structured: internal organization reveals ontological content Three layered Two layered Column Header Row Header Internal cells 12/01/2019 Marco Pennacchiotti pennacchiotti@coli-uni-sb.de www.coli.uni-saarland.de/~pennacchiotti/

Table ontological model Class tables Contain information on a class (property names, property values, instance names) 3-layered 12/01/2019 Marco Pennacchiotti pennacchiotti@coli-uni-sb.de www.coli.uni-saarland.de/~pennacchiotti/

Table ontological model Class tables Contain information on a class (property names, property values, instance names) 3-layered Instance tables Contain information on a single instance (property names and values) 2-layered (2-columns) (Instance: London) 12/01/2019 Marco Pennacchiotti pennacchiotti@coli-uni-sb.de www.coli.uni-saarland.de/~pennacchiotti/

Knowledge Extraction from tables (Input: table ; Output: table ontol. interpretation) Table identification (class vs. instance table) IF |columns| > 2 three-layered  class table ELSE IF ( column-header) three-layered  class table ELSE two-layered  instance table Table ontological analysis (identify ontol. entites) IF (instance table) column-1 = property names column-2 = property values IF (class table) decide how row /column headers map to property names / instance names according to internal cell data type. Apply Style-based heuristics Value-based heuristics 12/01/2019 Marco Pennacchiotti pennacchiotti@coli-uni-sb.de www.coli.uni-saarland.de/~pennacchiotti/

Evaluation Corpus :100 Wikipedia pages on cities, 207 tables Evaluation : Accuracy on a Gold Standard created by an expert ontology engineer Good performance, especially on table identification (Indirectly) comparable to other tools: Tartar accuracy on similar task is 0.85 [Pivk et al.,2007] Task Accuracy Table identification 0.91 Ontological analysis 0,77 Marco Pennacchiotti pennacchiotti@coli-uni-sb.de www.coli.uni-saarland.de/~pennacchiotti/ 12/01/2019

Module Interface Extract tables from web pages 12/01/2019 Marco Pennacchiotti pennacchiotti@coli-uni-sb.de www.coli.uni-saarland.de/~pennacchiotti/

Module Interface Extract tables from web pages Suggest interpretation for each table in the page 12/01/2019 Armando Stellato stellato@info.uniroma2.it ai-nlp.info.uniroma2.it/stellato

Module Interface Extract tables from web pages Suggest interpretation for each table in the page Ask user for validation Upload data into the ontology 12/01/2019 Marco Pennacchiotti pennacchiotti@coli-uni-sb.de www.coli.uni-saarland.de/~pennacchiotti/

Learning semantic relations from text INIZIO MARCO

Relation Extraction Relational knowledge is central to ontologies: is_a(X,Y), located_in(X,Y)… Relation extraction aims at (semi-)automatically extract relation instances from texts Most successful are pattern-based approaches [Hearst,1992] ( e.g. “X is in Y” for located_in(X,Y) ) We adopt a simple pattern-based approach with instance weighting and pattern generalization for refining the returned instances Given a seed instance(s) entered by the user, the system suggests new instances extracted from the Web, and uploads after user’s validation 12/01/2019 Marco Pennacchiotti pennacchiotti@coli-uni-sb.de www.coli.uni-saarland.de/~pennacchiotti/

Architecture Pattern induction algorithm similar to [Ravic&Hovy,2002] TARGET RELATION: CAPITAL_OF(X,Y) Pattern induction algorithm similar to [Ravic&Hovy,2002] Retrieve all sentences containing seeds (X,Y) Analyze with a dependency parser Induce patters as paths between X and Y (Madrid,Spain) "X is capital of Y„ “Y, whose capital is X" 12/01/2019 Marco Pennacchiotti pennacchiotti@coli-uni-sb.de www.coli.uni-saarland.de/~pennacchiotti/

Architecture Rank and select best instances Reliability measure R(i) scores higher instances that: Are fired by many patterns Have same PoS as seeds Having semantic classes similar to seeds TARGET RELATION: CAPITAL_OF(X,Y) 1 (Rome, Italy) 1 (Paris, France) 0.8 (London, England) 0.3 (Milan, fashion) (Madrid,Spain) (Rome, Italy) (Paris, France) (Milan, fashion) (London, England) "X is capital of Y„ “Y, whose capital is X" 12/01/2019 Marco Pennacchiotti pennacchiotti@coli-uni-sb.de www.coli.uni-saarland.de/~pennacchiotti/

Evaluation Corpus : 80 Wikipedia pages on capital city Relations : Capital-of and Located-in Evaluation : Prec /Rec on a Gold Standard set of instances manually extracted from corpus Precision close to state of the art Recall can be improved using different strategies (e.g. generic patterns, feedback) Capital-of Located-in * Antananarivo ; University district ; center Belmopan ; Belize * Internationals ; Amsterdam * Open ; Masterplan town ; province Budapest ; Hungary City ; Kazakihstan * It ; America * Exchange ; Bangkok Hargeisa ; Somaliland Beirut ; coastline Honiara ; Solomon Islands National Bank ; city Islamabad ; Pakistan Berlin ; Germany * Kingston ; United States mall ; Jakarta Manama ; Bahrain * It ; E 12/01/2019 Marco Pennacchiotti pennacchiotti@coli-uni-sb.de www.coli.uni-saarland.de/~pennacchiotti/

Conclusions and Future Work INIZIO MARCO

Future Work Table Analysis: improve user interaction in change&commit of proposed results Relation Extraction: use iterative algorithms to improve Recall Use of external resources to augment common sense knowledge of the tool Development of a dedicated extension framework for hosting different ling modules Include new NLP-based ontology learning modules (e.g. NER, complex event extractor) 12/01/2019 Marco Pennacchiotti pennacchiotti@coli-uni-sb.de www.coli.uni-saarland.de/~pennacchiotti/

Thanks! Questions? 12/01/2019 Armando Stellato stellato@info.uniroma2.it ai-nlp.info.uniroma2.it/stellato

12/01/2019 Armando Stellato stellato@info.uniroma2.it ai-nlp.info.uniroma2.it/stellato

Module interface 32 12/01/2019

RelEx : pattern induction Patterns are induced from the set of input instances We use an induction algorithm similar to (Ravichandran and Hovy 2002) All sentences containing the input instances are retrieved Sentences are parsed with the Chaos dependency parser (Basili&Zanzotto,2002) Patterns are induced from sentences ( “meaningful patterns” wrt surface approaches ) Patterns are generalized to ease data sparseness (small corpora) capital_of(Madrid,Spain) “Madrid since 1561 is the capital of Spain” PATTERN INDUCTION “X is the capital of Y” “X of Y” PATTERN GENERALIZATION “X is the capital of Y” “X has been the capital of Y” “X was the capital of Y” “X of Y” (dependencies omitted)

RelEx : instance ranking Instances are ranked according to a reliability measure R(i) Intuition: a reliable instance is one : that is fired by many patterns whose PoS are the same as the seed in which the semantic classes of X and Y are similar to those of the seed (e.g. “New Delhi” and “Madrid” are both cities)

RelEx : evaluation setup CORPUS: European and Asian Cities 80 Wikipedia pages (210.000 tokens) RELATIONS: Capital_of(X,Y), Located_in(X,Y) PARAM. SET: Reliability params set on a dev corpus of 10 pages (=0.05 =0.25 =0.74) EVALUATION: Gold Standard: instances Igs manually extracted from the corpus PRECISION RELATIVE-RECALL F-MEASURE GS-RECALL |I  Igs| R= |I  Igs|

RelEx : evaluation results Metrics variation on R(i) (graph for capital_of) Increasing R(i) good trends of Precision and Recall Precision up to state-of-the-art systems Recall is comparably low (no use of generic patterns) Should improve by using more seeds