Ontology-Based Information Extraction: Current Approaches.

Ontology-Based Information Extraction: Current Approaches

Internet today Data Exposion – 45 GB of data per person across the world – 988,000,000,000,000,000,000 available in net in 2010 – 60% yearly grouth – 1,800,000,000,000,000,000,000 (1,800 Exa Bytes) will be available till the end of 2011 (according to IDC statistics)

Internet Today (2) Web 2.0 User Generated content – To the end of 2013, 155 milions of users (only in US) will be using information created by others. – 115 milion users in US will actively create content in the Web – The increase of sharing data is currently 15 times larger than downloading data (data from 2008)

Search "...Search today is still kind of a hunt, where you get all these links, and as we teach software to understand the documents, really read them in the sense a human does, you will get answers more directly..." - Bill Gates.

Google search engine Query: „Which Nobel prize winners were born before Albert Einstein?” Google - 24,600,000 results: - Albert Einstein – Biography Albert Einstein – Biography - Albert Einstein - Wikipedia, the free encyclopediaAlbert Einstein - Wikipedia, the free encyclopedia - Jewish Nobel Prize Winners in PhysicsJewish Nobel Prize Winners in Physics - Nobel Prize Winners Hate School (Learn in Freedom!)Nobel Prize Winners Hate School (Learn in Freedom!) - HHF Factpaper: Jewish Nobel Prize Winners; Part II: PhysicsHHF Factpaper: Jewish Nobel Prize Winners; Part II: Physics Why? Becouse queries in google are key word based and do not distinguish semantic connections between words.

Solution to inprecise data The best solution would be to make everything available online presented in semantic way (idea o web 3.0 – Tim Berners Lee). Produce semantic aware Information extraction systems.

Information Extraction Reduces the information in document transforming it to a machine readable structure Tightly connected with NLP IE system are not trying to understand the input data Analyzes portions of documents containing relevant information

New view on IE More and more people are starting to see it not only as a process of retrieving disconnected text tokens, but more like obtaining meaningful semantic data

How do we get there -OBIE An Ontology-Based Information Extraction System: A system that processes unstructured or semi-structured natural language text through a mechanism guided by ontology to extract certain types of information and presents the output using ontology.

Yago vs. Google Query: „Which Nobel prize winners were born before Albert Einstein?” Yago - 1 result * - Johannes_Stark (15 April 1874 – 21 June 1957) was a German physicist, and Physics Nobel Prize laureate who was closely involved with the Deutsche Physik movement under the Nazi regime.Johannes_Stark * Note that comparing to google yago has very limited knowledge database

Usability of OBIE Natural language automating processing Creating semantic content for web 3.0 Improving the quality of the ontologies

Typical OBIE Architecture

Preprocessor Preprocessor consists of input specific modules which transform text into form that can be processed by extraction module. Preprocessing consist mainly of striping whitespaces, HTML tags, unreadable characters

Extraction module Extraction module is a place where actual IE takes place, right here the input data is being analyzed, changed into tokens understandable by ontology, and in the end bind with semantic relationship. The data produced by extraction module needs to be transformed into specific descriptive logic language (right now it is usually OWL) in order to be saved in knowledge database.

Rule Learning-Based Extraction Methods (RLBEM) Dictionary Based Method (DBM) – Before the IE begins a dictionary of patterns is created, later on this dictionary is used to extract needed information from the new untagged text. Based Method (RBM) – uses rules instead of dictionaries for IE

DBM example Assuming that we want to find information about terrorist attack, in this case one can use a concept that consists of the triggering word "bombed" together with the linguistic pattern passive-verb. Then when DBM finds sentence like "We are going to bomb NY metro tonight" concept will be activated (the sentence contains word bombed), than the linguistic pattern is matched against the sentence and the subject (in this case it will be NY metro) is extracted as the target of the terrorist attack.

Classification Based Extraction Method (CBEM) The basic idea behind CBEM is to look at the IE problem as it was a classification problem. Currently the most popular approach to classification problem is using Support Vector Machines (SVM) which are classified as unsupervised learning Artificial Neural Network systems.

Classification sample After proper training when given the text: "Professor Marian Makuch will give a speech about dark matter" the SVM CBEM system should point out the "Professor" token as a beginning of speaker label and "Makuch" as an end.

OBIE Ontology

Ontology-Based Information Extraction: Current Approaches.

Similar presentations

Presentation on theme: "Ontology-Based Information Extraction: Current Approaches."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Ontology-Based Information Extraction: Current Approaches.

Similar presentations

Presentation on theme: "Ontology-Based Information Extraction: Current Approaches."— Presentation transcript:

Similar presentations

About project

Feedback