Algorithm to populate Telecom domain OWL-DL ontology with A-box object properties derived from Technical Support Documents 1 Kouznetsov A, 2 Shoebottom.

Slides:



Advertisements
Similar presentations
1 OOA-HR Workshop, 11 October 2006 Semantic Metadata Extraction using GATE Diana Maynard Natural Language Processing Group University of Sheffield, UK.
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Chapter 5: Introduction to Information Retrieval
Dialogue – Driven Intranet Search Suma Adindla School of Computer Science & Electronic Engineering 8th LANGUAGE & COMPUTATION DAY 2009.
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
Adaptive Database Application Modeling API Final Project Report SOURENA NASIRIAMINI CS 491 6/2/2005.
Shared Ontology for Knowledge Management Atanas Kiryakov, Borislav Popov, Ilian Kitchukov, and Krasimir Angelov Meher Shaikh.
3-1 Chapter 3 Data and Knowledge Management
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
Overview of Search Engines
Transforming Data Models into Database Designs
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
Databases & Data Warehouses Chapter 3 Database Processing.
Improving Data Discovery in Metadata Repositories through Semantic Search Chad Berkley 1, Shawn Bowers 2, Matt Jones 1, Mark Schildhauer 1, Josh Madin.
Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
Survey of Semantic Annotation Platforms
A hybrid method for Mining Concepts from text CSCE 566 semester project.
Tables to Linked Data Zareen Syed, Tim Finin, Varish Mulwad and Anupam Joshi University of Maryland, Baltimore County
Miscellaneous Excel Combining Excel and Access. – Importing, exporting and linking Parsing and manipulating data. 1.
WebODE and its Ontology Management APIs. April 8th © Ontology Engineering Group WebODE and its Ontology Management APIs Ontology Engineering Group.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
1 Technologies for (semi-) automatic metadata creation Diana Maynard.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Theory and Application of Database Systems A Hybrid Approach for Extending Ontology from Text He Wei.
Math Information Retrieval Zhao Jin. Zhao Jin. Math Information Retrieval Examples: –Looking for formulas –Collect teaching resources –Keeping updated.
Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.
BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Benchmarking ontology-based annotation tools for the Semantic Web Diana Maynard University of Sheffield, UK.
Artificial Intelligence Research Center Pereslavl-Zalessky, Russia Program Systems Institute, RAS.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Xml:tm XML Based Text Memory Using XML technology to reduce the cost of translating XML documents 27 June 2005.
1 Technology in Action Chapter 11 Behind the Scenes: Databases and Information Systems Copyright © 2010 Pearson Education, Inc. Publishing as Prentice.
Xml:tm XML Text Memory Using XML technology to reduce the cost of translating XML documents.
Web- and Multimedia-based Information Systems Lecture 2.
Methods for Automatic Evaluation of Sentence Extract Summaries * G.Ravindra +, N.Balakrishnan +, K.R.Ramakrishnan * Supercomputer Education & Research.
Text Mining & NLP based Algorithm to populate ontology with A-Box individuals and object properties Alexandre Kouznetsov and Christopher J. O. Baker, University.
Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Engineering Informatics Lab at Stanford.
NLP pipeline for protein mutation knowledgebase construction Jonas B. Laurila, Nona Naderi, René Witte, Christopher J.O. Baker.
Query Suggestion. n A variety of automatic or semi-automatic query suggestion techniques have been developed  Goal is to improve effectiveness by matching.
Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media.
ESIP Semantic Web Products and Services ‘triples’ “tutorial” aka sausage making ESIP SW Cluster, Jan ed.
Mining the Biomedical Research Literature Ken Baclawski.
ACIS Introduction to Data Analytics & Business Intelligence Database s Benefits & Components.
Named Entity Disambiguation on an Ontology Enriched by Wikipedia Hien Thanh Nguyen 1, Tru Hoang Cao 2 1 Ton Duc Thang University, Vietnam 2 Ho Chi Minh.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
1 Class exercise II: Use Case Implementation Deborah McGuinness and Peter Fox CSCI Week 8, October 20, 2008.
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
AQUAINT Mid-Year PI Meeting – June 2002 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
哈工大信息检索研究室 HITIR ’ s Update Summary at TAC2008 Extractive Content Selection Using Evolutionary Manifold-ranking and Spectral Clustering Reporter: Ph.d.
Managing Data Resources File Organization and databases for business information systems.
Erasmus University Rotterdam
Presented by: Hassan Sayyadi
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
CSE 635 Multimedia Information Retrieval
CS246: Information Retrieval
Jonathan Griffin, Managing Director, IFIS Publishing &
Presentation transcript:

Algorithm to populate Telecom domain OWL-DL ontology with A-box object properties derived from Technical Support Documents 1 Kouznetsov A, 2 Shoebottom B, 1 Baker CJO 1 Department of Computer Science and Applied Statistics, University of New Brunswick, Saint John, Canada 2 Innovatia, Inc, Saint John, Canada

Motivation: Why Ontology-Centric? Problem: To respond information requests timely contact center workers need to search through many types of knowledge resources Challenge: increasing quality of service and decreasing contact center costs Solution: using the ontology‐centric platform – less escalation to more experienced workers – less time spent in resolving cases – training time is also greatly reduced

Motivation: Why Text Mining? Problem : Significant time spent by highly educated experts in populating ontology. Challenge: Reduce the workload Solution: Apply text mining - semiautomatic method for extracting information, specifically named entities and their relations, from texts and populating a domain ontology.

Focus We are focused on the problem of accurately extracting and populating relations between the named entities and presenting them as object properties between A-box individuals in an OWL-DL ontology.

Populate A-box Object Property. Single Property Domain Class Man Range Class Woman Object Property hasSister Domain Instance Samuel Range Instance Mary ? T-Box A-Box

Populate A-box Object Property. Multi- properties Domain Class Man Range Class Woman Object Property hasSister T-Box A-Box Object Property hasMother Domain Instance Samuel Range Instance Mary hasSister ? hasMother ?

More complicate case…. Domain Instance Samuel Range Instance Mary hasSister ? hasMother ? hasSameLastName ?

Methodology Ontology-based information retrieval applies Natural Language processing (NLP) to link text segments, named entities and relations between named entities to existing ontologies. Algorithm leverages a customized gazetteer list, including lists specific to object property synonyms Score A-box property candidates by using functions of distance between co-occurred terms. A-box Property prediction and population based on these scores (Thresholds, Fuzzy approach)

Main Implementation tools  Java  GATE/JAPE  OWLAPI

Semi-Automatic Ontology populating pipeline Source Documents XML Pre processing Synonyms Lists Text Segments Processing Text Segments Separation Sentences Tables Other Text Segments Ontology unpopulated (OWL) Term List (Excel) Ontology Population Named Entities Single Relations Multi Relations Populated Ontology Using Ontology Reasoning Visualizing Visual Queries Connecting Recourses

Populating Ontology Scoring Framework Co-occurrence Based Scores generator Relation Framework for A-box candidates extraction Candidate Decision Framework Decision module Reasoning Ontology Scores Focus Labelled Data Tres

Co-occurrence Based Scores generator Co-occurrence Based Scores generator (Light version) A-box Candidate All related content Scores Relations Framework Relation Object Tokenizer Gazetteer Score calculator Integrator Fragments Processor Synonyms List

Generation of Scores Relation Collection Framework to process Relation objects Relation Object integrates object property with: all types of related text fragments ontology objects and score processing intermediate and final results identified as : Domain Class: Domain Instance : Object Property : Range Class: Range Instance

Scores Generator: Details Score Calculator: Score calculation for text fragments associated with the Relation. Current version based on distance between occurred entities and number of text fragments with co-occurrence Includes by Text Fragments Processor and Integrator

2-terms and 3-terms scoring system Tokenizer Score Gazeteer Score Processor Domain Synonyms list Range Synonyms list Object Property Synonyms list Tokenized sentence sentence score Legend Legacy (2 terms) System Modified/Added on new (3 terms) system

Multiple Formats Score Generation Technical documentation contains knowledge displayed in multiple formats, each requiring different processing subroutines: Table Processing Sentence Processing Other segments

Extensible Data Model Document Segment Table Segment Data Cell ID Content Row Header ID Content Column Header ID Content Table Header ID Content Text Segment Sentence ID Content Document Corpus Doc ID Options: Sections, Paragraphs, Bullet lists, Headings

A-Box Prop. Population A-Box property candidates list Text Mining corpus Gazetteer List A-Box Obj. Properties (399) Properties with occurrence of domain or range Individuals (256) Properties with co-occurrence of domain and range Individuals (143) Ontology processing T-Box Obj. Properties (102)

A-Box scoring Evidences for A-box Obj. Property candidates Current A-box Object Property Candidate Evidences for Current A-box (co-occurrence of Domain and Range) Text Segment Sentence ID Content Text Segment Sentence ID Content Text Segment Sentence ID Content Text Segment Sentence ID Content Table Segment Data Cell ID Content Row Header ID Content Column Header ID Content Table Header ID Content Table Segment Data Cell ID Content Row Header ID Content Column Header ID Content Table Header ID Content Table Segment Data Cell ID Content Row Header ID Content Column Header ID Content Table Header ID Content Table Segment Data Cell ID Content Row Header ID Content Column Header ID Content Table Header ID Content Evidences for Current A-box (occurrence of Domain or Range) Text Segment Sentence ID Content Text Segment Sentence ID Content Text Segment Sentence ID Content Text Segment Sentence ID Content Table Segment Data Cell ID Content Row Header ID Content Column Header ID Content Table Header ID Content Table Segment Data Cell ID Content Row Header ID Content Column Header ID Content Table Header ID Content Table Segment Data Cell ID Content Row Header ID Content Column Header ID Content Table Header ID Content Table Segment Data Cell ID Content Row Header ID Content Column Header ID Content Table Header ID Content

Table Segments: Primary Scoring Table Segment Data Cell ID Content Row Header ID Content Column Header ID Content Table Header ID Content A-Box scoring Current A-box Object Property Candidate DomainPropertyRange

Table Segments: Secondary Scoring Table Segment Data Cell ID Content Row Header ID Content Column Header ID Content Table Header ID Content A-Box scoring Current A-box Object Property Candidate DomainPropertyRange

Sentence Scoring A-box Object property Score for sentence SentenceScore=1/(distance+1)+Bonus Integrated Object property Score over all related sentences IntegratedScore= SUM(SentenceScore) Summarize Integrated Score with Table Scores Normalized Object property Score NormolizedScore= IntegratedScore/Norm

Sentence scoring Score=1/(distance+1)+Bonus 1 DR 2 123D4R 4 12PD4R 3 123D4R6P Domain Synonym Range Synonym Object Property Synonym DRP Distance: 1000, Bonus =0, Score= 1/(1000+1)+0= Distance: 4, Bonus =0, Score= 1/(4+1)+0=0.2 Distance: 6, Bonus =3, Score= 1/(6+1)+3=3.14 Distance: 4, Bonus =10, Score= 1/(4+1)+10=10.2

Example Sentence Type 1 1 DR Distance: 1000, Bonus =0, Score= 1/(1000+1)+0= sentence before cleaning: [" Rotate the insert/extract levers to eject the 8660 SDM from the chassis.] Final Score= E-4 Best Bonus=0.0 Final Distance= Telecommunications_Chassis:8010co_Chassis:hasChassis_Shipping_Accessories:Telecomm unications_Chassis_Screws:Screws Property Synonyms: need have require has Domain Synonyms: 8010co chassis 8010co Chassis 8010 CO chassis 8010co 8010CO chassis Range Synonyms: Screws screws

Example Sentence Type 2 sentence after cleaning: In a chassis that includes two power supplies in a non redundant power configuration, you must start both restrictions dual power supplies power supply units within 2 seconds of each other. Final Score=0.05 Best Bonus=0.0 Final Distance=19 Telecommunications_Chassis:Chassis:hasChassis_Components:Telecommunicatio ns_Chassis_Power_Supply:Power_Supply Property Synonyms: have has Domain Synonyms: chassis switch chassis 8000 series Chassis CO chassis Range Synonyms: Power Supply transformer power supply power module Power supply 2 123D4R

Example Sentence Type 4 sentence after cleaning: In a chassis that includes two power supplies in a non redundant power configuration, you must start both restrictions dual power supplies power supply units within 2 seconds of each other. Final Score=10.05 Best Bonus=10.0 Final Distance=19 Telecommunications_Chassis_Power_Supply:Power_Supply:isPart_of_Chassis:Telecommuni cations_Chassis:Chassis Property Synonyms: used in include Domain Synonyms: Power Supply transformer power supply power module Power supply Range Synonyms: chassis switch chassis 8000 series Chassis CO chassis 4 12PD4R

Bonus Calculation 12PD4R6 123DR6P Distance: 6, Bonus Constant =10, Tokens in Property=2, Score= 1/(6+1)+2*10=20.14 Distance: 6, Bonus Constant=10, Tokens in Property=1, Score= 1/(6+1)+1*10=10.14 P 3 Bonus= Bonus Constant * Number of tokens in property Sentence Example: Device X does not support Device Y Object Properly Tokens Number Obtained Score Support 1 1/(3+1)+1*10=10.25 Not Support 2 1/(3+1)+2*10=20.25 V

Normalization Norm coefficient for A-box object property Log(1.0+(NSD+1.0/Cd) *(NSR+1.0/Cr) ) NSD – Number Of Sentences Domain Occurred Cd – Domain Synonyms List Cardinality NSR – Number Of Sentences Range Occurred Cr – Range Synonyms List Cardinality

Gold Standard and Evaluation Framework A-Box Ontology T-Box Ontology Labels Evaluation Report Source Documents XML Pre processing Synony ms Lists Text Segments Processing Text Segmen ts Separati on Senten ces Tables Bullet Lists Ontology unpopulated (OWL) Term List (Excel) Ontology Population Name d Entitie s Single Relati ons Multi Relati ons Populated Ontology Using Ontology Reasoni ng Visualizi ng Visual Queries Connect ing Recours es Populate Ontology Prediction evaluation Framework Evaluate predicted Properties / Update DB Golden Standard Database Import labels Knowledge Engineer

Thresholds: Decision Boundary  All scores for each A-box property candidate are summarized for based on eligible sources of evidence for the A-box in question  Threshold in use  Trade off - Recall vs. Precision

Results for Tables: Baseline result Focus on Positive class Recall and Positive class Precision  Class of interest (Positive class)  Recall =0.80  Precision=0.85

Results for Tables: Continued Focus on Positive class Precision  Class of interest (Positive class)  Recall =0.25  Precision=1.0

Results for Tables: Continued Focus on Positive class Recall  Class of interest (Positive class)  Recall =1.0  Precision=77.5

Results for Sentences Focus on Positive class Precision  Class of interest (Positive class)  Recall =0.14  Precision=1.0

Results for Sentences and Tables Focus on Positive class Precision  Class of interest (Positive class)  Recall =0.4  Precision=1.0  Synergetic effect of using Sentences and Tables (wrt Precision=1.0): Recall (sentences)= 0.14 Recall (tables)= 0.25 Recall (sentences & tables)= 0.4

Advantages  Improve Quality of Knowledge Base  Managing the argumentation process KB vs KE  Iterative improvement of accuracy  Tier1 doing Tier 2 task (improve service)  Tier1 (high precision) KB query  Tier 2 (high recall) – knowledge integration  Facilitate information processing without KE  Reduce workload (saving)

Improve Quality of Knowledge Base Offline task by Knowledge Engineer Disambiguation – Expert can pay special attention to any significant inconsistency in human and machine outputs such as - Highly scored A-box candidates labeled as negatives Human Expert & Machine Committee vs. single human expert

Real Time Integration of New Evidence Online, by call centre worker, at knowledge use stage – Extracting additional object properties from new documents for emergency case – High Positive Precision focused scenario Offline, by Senior call centre worker, at knowledge use stage – Extracting additional object properties from new documents for questions not answered online – High Positive Recall focused scenario

Reduce Workload Online and Offline Automatically Extracted Evidenced Ranked Solutions with notified level of confidence

Gold Standard Corpus and Evaluation Framework A-Box Ontology T-Box Ontology Labels Evaluation Report Source Documents XML Pre processing Synony ms Lists Text Segments Processing Text Segmen ts Separati on Senten ces Tables Bullet Lists Ontology unpopulated (OWL) Term List (Excel) Ontology Population Name d Entitie s Single Relati ons Multi Relati ons Populated Ontology Using Ontology Reasoni ng Visualizi ng Visual Queries Connect ing Recours es Populate Ontology Prediction evaluation Framework Evaluate predicted Properties / Update DB Golden Standard Database Import labels Knowledge Engineer

Future Work: Extend Literature Scheme Sections Paragraphs Bullet Lists Connect with Headings and Topics