Extraction and Indexing of Triplet- Based Knowledge Using Natural Language Processing From Text to Information.

Slides:



Advertisements
Similar presentations
Ontology-Based Computing Kenneth Baclawski Northeastern University and Jarg.
Advertisements

Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Language Technologies Reality and Promise in AKT Yorick Wilks and Fabio Ciravegna Department of Computer Science, University of Sheffield.
An Ontology Creation Methodology: A Phased Approach
Question Answering Based on Semantic Graphs Lorand Dali – Delia Rusu – Blaž Fortuna – Dunja Mladenić.
CH-4 Ontologies, Querying and Data Integration. Introduction to RDF(S) RDF stands for Resource Description Framework. RDF is a standard for describing.
Haystack: Per-User Information Environment 1999 Conference on Information and Knowledge Management Eytan Adar et al Presented by Xiao Hu CS491CXZ.
Chapter 5: Introduction to Information Retrieval
SemTag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation Presented by: Hussain Sattuwala Stephen Dill, Nadav Eiron, David Gibson,
A Framework for Automated Corpus Generation for Semantic Sentiment Analysis Amna Asmi and Tanko Ishaya, Member, IAENG Proceedings of the World Congress.
Provenance in Open Distributed Information Systems Syed Imran Jami PhD Candidate FAST-NU.
Dialogue – Driven Intranet Search Suma Adindla School of Computer Science & Electronic Engineering 8th LANGUAGE & COMPUTATION DAY 2009.
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June Gulla, Brasethvik and Kaada A Flexible Workbench for Document.
1 Information Retrieval and Web Search Introduction.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Assuming Accurate Layout Information for Web Documents is Available, What Now? Hassan Alam, Rachmat Hartono, Aman Kumar, Fuad Rahman, Yuliya Tarnikova.
March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing INTRODUCTION Muhammed Al-Mulhem March 1, 2009.
Siemens Big Data Analysis GROUP 3: MARIO MASSAD, MATTHEW TOSCHI, TYLER TRUONG.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
Artificial intelligence & natural language processing Mark Sanderson Porto, 2000.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Practical RDF Chapter 1. RDF: An Introduction
Indexing Knowledge Daniel Vasicek 2014 March 27 Introduction Basic topic is : All Human Knowledge Who Cares? Simple Examples.
Survey of Semantic Annotation Platforms
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Authors: Ting Wang, Yaoyong Li, Kalina Bontcheva, Hamish Cunningham, Ji Wang Presented by: Khalifeh Al-Jadda Automatic Extraction of Hierarchical Relations.
Author: James Allen, Nathanael Chambers, etc. By: Rex, Linger, Xiaoyi Nov. 23, 2009.
Session II: Scientific Publishing and Semantic Web W3C Semantic Web for Life Sciences Workshop October 27, 2004 Moderator: Alan R. Aronson.
Extracting Semantic Constraint from Description Text for Semantic Web Service Discovery Dengping Wei, Ting Wang, Ji Wang, and Yaodong Chen Reporter: Ting.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
1 Technologies for (semi-) automatic metadata creation Diana Maynard.
Methods for the Automatic Construction of Topic Maps Eric Freese, Senior Consultant ISOGEN International.
©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek (610)
updated CmpE 583 Fall 2008 Ontology Integration- 1 CmpE 583- Web Semantics: Theory and Practice ONTOLOGY INTEGRATION Atilla ELÇİ Computer.
Efficiently Computed Lexical Chains As an Intermediate Representation for Automatic Text Summarization H.G. Silber and K.F. McCoy University of Delaware.
1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Proposed NWI KIF/CG --> Common Logic Standard A working group was recently formed from the KIF working group. John Sowa is the only CG representative so.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
A.F.K. by SoTel. An Introduction to SoTel SoTel created A.F.K., an Android application used to auto generate text message responses to other users. A.F.K.
Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Engineering Informatics Lab at Stanford.
Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting.
Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.
Presented By- Shahina Ferdous, Student ID – , Spring 2010.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
THE SEMANTIC WEB By Conrad Williams. Contents  What is the Semantic Web?  Technologies  XML  RDF  OWL  Implementations  Social Networking  Scholarly.
NATURAL LANGUAGE PROCESSING Zachary McNellis. Overview  Background  Areas of NLP  How it works?  Future of NLP  References.
1 Open Ontology Repository initiative - Planning Meeting - Thu Co-conveners: PeterYim, LeoObrst & MikeDean ref.:
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Semantic Wiki: Automating the Read, Write, and Reporting functions Chuck Rehberg, Semantic Insights.
Consumer Health Question Answering Systems Rohit Chandra Sourabh Singh
An Ontology-based Automatic Semantic Annotation Approach for Patent Document Retrieval in Product Innovation Design Feng Wang, Lanfen Lin, Zhou Yang College.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Semantic Graph Mining for Biomedical Network Analysis: A Case Study in Traditional Chinese Medicine Tong Yu HCLS
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Information Retrieval and Web Search
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Information Retrieval and Web Search
Semantic Web Annotation
Information Retrieval and Web Search
Social Knowledge Mining
CSE 635 Multimedia Information Retrieval
How to publish in a format that enhances literature-based discovery?
CS246: Information Retrieval
Information Retrieval and Web Search
Presentation transcript:

Extraction and Indexing of Triplet- Based Knowledge Using Natural Language Processing From Text to Information

Issues with Current Search Methods 1. Entity Placement Problem - When an entity is hashed to a location in memory this provides no understanding of the specificity, generality, or relationship the term has to other entities. 2. Relationship Recognition Problem - Indexing based on term location causes any relationships between entities presented in the text to go unprocessed.

Solution Sophisticated Natural Language Processing Text is first parsed by our natural language processing engine to allow recognition of entities and relationships Entities and relationships are then stored in a manner that injects a schema and maintains relationships

Background Outline Systems utilizing ontologies Systems utilizing templates Systems utilizing natural language parsing Systems that require structured language Entity disambiguation systems Our system Mikrokosmos Project Artequakt Project Message Understanding System Semtag and Seeker Attempt Controlled English HTML Extractor Semantic Knowledge Representation Message Understanding System Semantic Document Summarization Semantic Knowledge Representation

Background The Mikrokosmos Project Utilizes a situated ontology for in-depth domain understanding Limited learning of new concepts Difference from our work: Our system requires no previously created ontology Works with any domain K. Mahesh, and S. Nirenburg, A Situated Ontology for Practical NLP. In Proceedings Workshop on Basic Ontological Issues in Knowledge Sharing, 1995.

Background Message Understand System Extracts information based on language understanding Uses WordNet in addition to domain information Difference from our work: No template needed No specific domain understanding needed A. Bagga, J.Y. Chai, and A.W. Bierman. The role of WordNet in the creation of a trainable message understanding system. In Proceedings of the Thirteenth National Conference on Artificial Intelligence and the Eighth Innovative Applications of Artificial Intelligence Conference

Background Semtag and Seeker Tags entity with a proper disambiguated TAP reference Provides indexing system to quickly locate entities Difference from our work: We extract information regarding entities Semtag represents future work S. Dill, N. Eiron, D. Gibson, D. Gruhl, R. Guha, A. Jhingran, T. Kanungo, S. Rajagopalan, A. Tomkins, J. A. Tomlin, and J. Y. Zien. SemTag and Seeker: Bootstrapping the semantic Web via automated semantic annotation. World Wide Web Conference Budapest, Hungary (2003)

Background Artequakt Project Uses classification ontology Searches web to locate information Difference from our work: No classification ontology needed No need to crawl web-pages to extract even simple bits of information H. Alani, S. Kim, D. Millard, M. Weal, W. Hall, P. Lewis, and N. Shadbot. Automatic ontology- based knowledge extraction from web documents. IEEE Intelligent Systems, 2003; pp

Background Semantic Document Summarization Documents are translated into semantic graph Graph is then inspected to determined representative sentences to be used for summarization Difference from our work: Graph used is an internal representation and does not properly represent information Reduces documents to summary sentences rather than to triplet form Jure Leskovec, Marko Grobelnik, and Natasa Milic-Frayling. Learning sub-structures of Docment Semantic Graphs for Document Summarization. In Link Analysis and Group Detection, 2004.

Background HTML Extractor Uses HTML code and natural language to create a semantic graph of a web-page Uses scrubbers to extract information Differences from our work: No scrubbers needed Works over any text V. Svatek, J. Braza, and V. Sklenak. Towards Triple-Based Information Extraction from Visually- Structured HTML Pages. In Poster Track of the 12 th International World Wide Web Conference, Budapest, 2003.

Background Semantic Knowledge Representation Natural language parsing is used to locate noun phrases in biomedical abstracts Noun phrases are compared against terms in a thesaurus for disambiguation Differences from our work: We extract information regarding entities More sophisticate natural language processing Suresh Srinivasan, Thomas C. Rindflesch, William T. Hole, Alan R. Aronson, and James G. Mork. Finding UMLS Metathesaurus Concepts in MEDLINE. Proceedings of the American Medical Infomatics Association, 2002.

Background Attempto Controlled English Authors are asked to represent the major information in their writings in ACE format This allows rapid language processing and data mining Differences from our work: No secondary language needed Text mining and information processing directly from the written text Tobias Kuhn, Loic Royer, Norbert E. Fuchs, Michael Schroeder. Improving Text Mining with Controlled Natural Language: A Case Study for Protein Interactions. In Third International Workshop on Data Integration in the Life Sciences, Hinxton, UK, 2006.

Architectural Overview

Natural Language Processing Engine Overview Text is first parsed by JavaNLP to create a sentence tree object Sentence tree object is then parsed to create triplets

Natural Language Parsing It is possible to use other parsers, however Stanford’s Natural Language Parser was chosen over other parsers for a number of reasons: Java implementation Log Linear Time Older more established code base

The Sentence Tree (ROOT [69.474] (S [69.371] (NP [20.560] (NNP [8.264] Tiger) (NNP [9.812] Woods)) (VP [47.672] (VBZ [11.074] donates) (PP [31.541] (TO [0.003] to) (NP [27.963] (NP [15.561] (DT [1.413] a) (JJ [5.475] large) (NN [5.979] number)) (PP [11.856] (IN [0.669] of) (NP [10.784] (NNS [7.814] charities)))))) (. [0.002].)))

Parsing the Sentence Tree 1. Entity Recognition 2. Predicate - Object Recognition 3. Predicate - Object Augmentation 4. Triplet Creation 5. Pronoun Resolution 6. Triplet Filtration 7. Secondary Predicate Parsing

Parsing the Sentence Tree Triplet Creation Step Portions of Parse Tree Inspected Product of Parse Entity Recognition (NP [20.560] (NNP [8.264] Tiger) (NNP [9.812] Woods)) “Tiger Woods” Predicate – Object Recognition (VP [47.672](VBZ [11.074] donates) (PP [31.541] (TO [0.003] to) (NP [27.963] (NP [15.561] (DT [1.413] a) (JJ [5.475] large) (NN [5.979] number)) “Tiger Woods” “a large number” Predicate – Object Augmentation (PP [11.856] (IN [0.669] of) (NP [10.784] (NNS [7.814] charities)))))) “Tiger Woods” “charities”

Triplet Storage Triplets are then stored in the Term Hierarchy Tree Composed of information in TAP and WordNet Ability to add other ontologies Lends a schema to the information extracted from text

Thing SportsBooks GolfFictionBowling Nonfiction Tiger WoodsDune ESPN The Term Hierarchy Tree

What is the use of the Tree? We are able to not only locate information directly related to the searched for entity but also know its relation to other entities. In the previous example “Tiger Woods” is found under Golf, beyond this we also get the information that Golf is a Sport.

Query Processing The query entered by the user is first passed to the Natural Language Parser before other processing occurs Simple searches are reduced to their component entities Complex searches are reduced to triplets and then both the triplet and the contained entities are searched on

Entity and Relationship Searching Not only entities searched for but also specified relations. Tiger Woods works with Charities

How is the Query Executed? The entity or relationship provides a “link” into the Term Hierarchy Tree. Root Sports Books Kids Golf Entity Term Hierarchy Tree Tiger Woods

Document Storage Document X Entity Recognition Triplet Creation Document X Tiger Woods PGA Tiger Woods tournament Document Metrics Tiger Woods: 12 PGA: 5 Ping: 3 Root Sports Books Kids Golf Storage functions

Document Retrieval Can Tiger Woods play Tennis? Query: Entity Recognition Triplet Creation Query: Tiger Woods Retrieval functions Root Sports Books Kids Golf play Tennis

Related Concepts Term Frequency / Inverse Document Frequency (TF/IDF) TF/IDF’s concepts are used in how the system stores documents This work adds the relations between entities

Triplet Production Testing Testing occurred in two phases: Expert Testing Inexpert Testing

Results from Expert Testing

Expert Testing Results

Inexpert Testing Results All triplets generated by the nine students were inspected and a set of unique triplets was determined This was compared to the triplets generated by the system 53% overlap between the two Average of 27% of human created triplets were incorrect

Addressing Inexpert Testing The seeming decline in accuracy stems from two major causes: The computer system captured more triplets The human subject made inferences regarding the information

Contributions Automated method of creating semantic information Capture of the relationships among entities Understanding of an entity’s place in the “grand scheme of things”