Towards an Automatic Analyze and Standardization of Unstructured Data in the Context of Big and Linked Data Hammou FADILI(1), Christophe JOUIS (2) (1)CEDRIC,

Slides:

Advertisements

Similar presentations

Testing Relational Database

Advertisements

1 OOA-HR Workshop, 11 October 2006 Semantic Metadata Extraction using GATE Diana Maynard Natural Language Processing Group University of Sheffield, UK.

CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio.

1 CS101 Introduction to Computing Lecture 17 Algorithms II.

1 © 2006 Curriculum K-12 Directorate, NSW Department of Education and Training Implementing English K-6 Using the syllabus for consistency of teacher judgement.

Dialogue – Driven Intranet Search Suma Adindla School of Computer Science & Electronic Engineering 8th LANGUAGE & COMPUTATION DAY 2009.

Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.

Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.

Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000.

1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.

Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.

Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.

Using Use Case Scenarios and Operational Variables for Generating Test Objectives Javier J. Gutiérrez María José Escalona Manuel Mejías Arturo H. Torres.

Cloud based linked data platform for Structural Engineering Experiment Xiaohui Zhang

GL12 Conf. Dec. 6-7, 2010NTL, Prague, Czech Republic Extending the “Facets” concept by applying NLP tools to catalog records of scientific literature *E.

LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.

CASE Tools And Their Effect On Software Quality Peter Geddis – pxg07u.

Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.

Funded by: European Commission – 6th Framework Project Reference: IST WP 2: Learning Web-service Domain Ontologies Miha Grčar Jožef Stefan.

© Copyright 2008 STI INNSBRUCK NLP Interchange Format José M. García.

-1- Philipp Heim, Thomas Ertl, Jürgen Ziegler Facet Graphs: Complex Semantic Querying Made Easy Philipp Heim 1, Thomas Ertl 1 and Jürgen Ziegler 2 1 Visualization.

Of 33 lecture 10: ontology – evolution. of 33 ece 720, winter ‘122 ontology evolution introduction - ontologies enable knowledge to be made explicit and.

1 Workshop on Business-Driven Enterprise Application Design & Implementation Cristal City, Washington D.C., USA, July 21, 2008 How to Describe Workflow.

Spanish FrameNet Project Autonomous University of Barcelona Marc Ortega.

Formal Specification and Analysis of Software Architectures Using the Chemical Abstract Machine Model CS 5381 Juan C. González Authors: Paola Inverardi.

1 INTEGRATION OF THE TEXTUAL DATA FOR INFORMATION RETRIEVAL : RE-USE THE LINGUISTIC INFORMATION OF VICINITY Omar LAROUK ELICO -ENS SIB University of Lyon-France.

SEMANTICS VS PRAGMATICS Semantics is the study of the relationships between linguistic forms and entities in the world; that is how words literally connect.

WEB PAGE CONTENTS VERIFICATION AGAINST TAGS USING DATA MINING TOOL IKNOW VІI scientific and practical seminar with international participation "Economic.

Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.

Extracting value from grey literature Processes and technologies for aggregating and analysing the hidden Big Data treasure of the organisations.

Knowledge Support for Modeling and Simulation Michal Ševčenko Czech Technical University in Prague.

Object Oriented Programming and Data Abstraction Earl Huff Rowan University.

Semantic Web Technologies Readings discussion Research presentations Projects & Papers discussions.

SERVICE ANNOTATION WITH LEXICON-BASED ALIGNMENT Service Ontology Construction Ontology of a given web service, service ontology, is constructed from service.

Engineering, 7th edition. Chapter 8 Slide 1 System models.

WP4 Models and Contents Quality Assessment

Human Computer Interaction Lecture 21 User Support

Ricardo EIto Brun Strasbourg, 5 Nov 2015

Towards a framework for architectural design decision support

Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :

Approaches to Machine Translation

Visual Information Retrieval

Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance Hello everyone,

Sentiment analysis algorithms and applications: A survey

Cloud based linked data platform for Structural Engineering Experiment

Presentation on Decision support system

SEMANTICS VS PRAGMATICS

Information Delivery Manuals: Functional Parts

The Systems Engineering Context

Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin

Presented by: Hassan Sayyadi

Abstract descriptions of systems whose requirements are being analysed

Program comprehension during Software maintenance and evolution Armeliese von Mayrhauser , A. Marie Vans Colorado State University Summary By- Fardina.

Social Knowledge Mining

Experience Management

Statistical NLP: Lecture 9

Exploring Scholarly Data with Rexplore

Chapter 20 Object-Oriented Analysis and Design

ece 627 intelligent web: ontology and beyond

Data Information Knowledge and Processing

Approaches to Machine Translation

CSE 635 Multimedia Information Retrieval

Test Case Test case Describes an input Description and an expected output Description. Test case ID Section 1: Before execution Section 2: After execution.

Introduction of Week 11 Return assignment 9-1 Collect assignment 10-1

Chapter 11 user support.

CS246: Information Retrieval

Applied Linguistics Chapter Four: Corpus Linguistics

Software Design Methodologies and Testing

Statistical NLP : Lecture 9 Word Sense Disambiguation

Introduction Dataset search

Presentation transcript:

Towards an Automatic Analyze and Standardization of Unstructured Data in the Context of Big and Linked Data Hammou FADILI(1), Christophe JOUIS (2) (1)CEDRIC, Conservatoire National des Arts & Métiers – CNAM (2)University Paris Sorbonne Nouvelle – Paris III Pierre & Marie Curie University (UPMC), LIP6, ACASA Team This présentation entitled : Towards an Automatic Analyze and Standardization of Unstructured Data in the Context of Big and Linked Data Concerns the project that we are currently implementing that aims to automatize the conversion and the integration of unstructured data of Big data in the LOD This work has been done by HF from cnam and CJ from lip6 EC3 Labs

The World Wide Web is a dynamic, ever-changing world Introduction The World Wide Web is a dynamic, ever-changing world Large corpus of heterogeneous resources (Big data) 80% are in unstructured format Not annotated corpus Problematic The problem is that we can not do without these unstructured data because they are also precious more than structured ones This implies the need to implement sophisticated processes allowing the automatic analyze and exploitation of all kinds of data As you know the WWW is … Generating large corpus called big data 80% of the produced data in unstructured data Constituting not annotated corpus therefore not usable effectively. This generate many problems The problem is that we can not do without these unstructured data because they are also precious more than structured ones This implies the need to implement processes allowing the automatic analyze and exploitation of all kinds of data

Some keywords Objectives Motivation Our contribution Conclusion Outline Some keywords (Linked, Big, Smart) data Objectives What we want to do and what we are doing Motivation Related works Comparison study of some existing and known solutions Our contribution The foundation of EC3 The application Conclusion In this presentation We will begin by reminding some used keywords and introducing the principal objectives of our work Followed by the description of the motivation of the proposed approach. This part is composed of two subparts: A description of some related works A comparaison of some existing and known solutions The third part will dedicated the description of our contribution. And this part composed of three subparts: Some reminders and vocabulary The foundation of the tool EC3 The implemented application The last section will conclude the presentation.

Linked Open Data (LOD) Big data Smart data Some keywords is a method of publishing normalized (structured or annotated) data in order to be interlinked and queryable through semantic queries. Big data is characterized by large volumes of varied data, generated and shared quickly: 3V (Volume, Velocity, and Variety) Smart data are interpreted data, unambiguous, usable and useful, outcome generally from the results of semantic analysis of texts Linked Open Data (LOD) is a method of publishing structured or annotated data in order to be interlinked and queryable through semantic queries. Big data is characterized by large volumes of varied data, generated and shared quickly: 3V (Volume, Velocity, and Variety) Smart data are interpreted data, unambiguous, usable and useful, outcome generally from the results of semantic analysis of texts

Objectives We are developing an approach that consists in implementing a cyclic process Encapsulating our previous works and tools to be compatible with new standards of semantic web and LOD Exploiting the structured or annotated part of the Web (linked open data: LOD) as training data Computing semantic text analysis on the unstructured part of the Web Extract of relevant data (Smart data or interpreted data) Standardize the obtained data according to Semantic web & LOD standards, norms, etc. Connecting the new standardized data in the right place in the LOD We are developing an approach that consists in implementing a cyclic process Encapsulating our previous works and tools to be compatible with new standards of semantic web and LOD Exploiting the structured or annotated part of the Web (linked open data: LOD) as training data Computing semantic text analysis on the unstructured part of the Web Extract of relevant data (Smart data or interpreted data) Standardize the obtained data according to Semantic web & LOD standards, norms, etc. Connecting the new standardized data in the right place in the LOD

This section aims to composed of two parts Motivation Explain gain our problematic through some use cases Motivate the proposed approach composed of two parts Presenting some related works via some examples in the contexts LOD Big data Smart data Presenting unresolved issues until now comparative study This section aims to Explain again our problematic through some use cases Motivate the proposed approach composed of two parts Presenting some related works via some examples in the contexts LOD Big data Smart data unresolved issues until now comparative study

In the context of LOD, we have 3 categories of works Motivation / Related works / Linked Open Data (LOD) context In the context of LOD, we have 3 categories of works Exploiting LOD to analyze unstructured data Connecting new data in the LOD Combining the two approaches LODifier is one of them

This example combines the two approaches Motivation / Related works / Linked Open Data (LOD) context / LODifier This example combines the two approaches The LODifier approach combines several technologies to perform semantic analysis of texts and their integration into the LOD It is based on the named entity recognition and disambiguation of words based on controlled vocabularies. Its purpose is to extract entities and relationships from text; convert them into RDF, and then link them to DBpedia or WordNet RDF. In order to simplify the presentation of examples, here the work combining the two approaches.

Motivation / Related works / Big data context In the context of Big data, the aim is to convert unstructured data to their structured and semantically annotated format. In this context, we have 2 categories Based on the analysis of texts as raw materials NLP, semantic analysis based on a middle layer like NOSQL Low structured data (key-value) In the context of Big data, the related works are of two categories: These performing unstructured data analysis using text analysis technics like: NLP, etc. These based on a middle layer like NOSQL that manage Low structured data

Motivation / comparative study We have performed a comparative study of several academic and industrial tools of NLP and semantic analysis. This by submitting ambiguous sentences or texts and analyzing the results. These results show that the various solutions support almost NLP basic techniques in their early stages of analysis But beyond that, each solution implements its own technology to improve the management of semantic analysis and its integration in various fields (indexing, extracting, and searching), in Big data exploitation, enrichment and / or exploitation of Linked open data, etc For the management of semantics, most of the tools support the named entity recognition; some go a little beyond, but without the full and effective management of semantics So it is at this level that we want to contribute

Analyze the non structured part of the Web is very difficult Motivation / conclusion These elements related to the motivation helped us to well understand our problematic and to conclude that Analyze the non structured part of the Web is very difficult Many solutions began to use one part (normalized) of the web to standardize the other part (non normalized) Many problems remain unresolved, particularly in terms of semantic analysis It is in this context that we are currently constructing a new approach, based on two important notions, namely the notion of context and the notion of semantic relations These elements related to the motivation helped us to well understand our problematic and to conclude that Analyze the non structured part of the Web is very difficult Many solutions began to use one part of the web to standardize the other part Many problems remain unresolved, particularly in terms of semantic analysis It is in this context that we are currently constructing a new approach, based on two important notions, namely the notion of context and the notion of semantic relations

First, a question to the assistance !!! What is exactly a “sentence” ? Our contribution : EC3 (Exploration Contextuelle – EC - Contextual Exploration) First, a question to the assistance !!! What is exactly a “sentence” ? ----------------------------------------------------------------------------------------------------- Goal : Extract pertinent information from heterogeneous texts from a point of view such as : Semantic relations between Named Entities, Who speak about what from which ? (citations, …) Method : No syntactic analyses, No “universal” or “domain-dependent’ dictionaries, Focus on “empty” words that are in fact the real knowledge of a language.

EC3 : Foundations Applicative and Cognitive Grammar(Desclés, Paris-Sorbonne) 3 Levels of representation: Morpho-syntactic structures : Not universal, depend on language, Applicative or predicative structures : Descriptions are presented in the form of applicative expressions (operators applied to operands from different types) : OPL (P a1 a2… an) : Not universal, depend on language, Cognitive level : meanings of linguistic units may be analyzed under the form of layouts (semantic-cognitive representations) in order to constitute the knowledge representations associated to a given text : Universal.

EC3 : Main Idea Connectors between terms is a set of language indicators to isolate semantic relationships between terms / concepts. These “indicators” reflect linguistic knowledge independent of a particular field of knowledge.

EC3 : architecture

EC3 software - KBS Consists of contextual exploration rules: IF [conditions] THEN [conclusions] [conditions]: express the co-presence or explicit absence of relevant linguistic units in the same context (part of text, proposition, paragraph, etc.) [Conclusions]: progressive constructions of representations (semantic relationships between words) Search in … sequences of linguistic units

EC3 Software We have implemented Example: RULE ing48 200 rules (for French) 3000 Markers Example: RULE ing48 LET x1, x2, x3, x4 linguistic units(markers), P a sequences of linguistic units IF x1 is an occurrence of the verb "être" or one of the symbol [,], [.] or [-],] AND x2 is in lists LIN3, LIN4 or LIN5 (see slide after) AND x3 is in {[avec], [de], [par]} AND x4 is le linguistic unit [de] AND x1 x2 x3 x4 follow in P THEN It is the semantic relation part/of in P.

EC3 : detection of a relation part/of « (…) Chaque système [Airbag] est composé de : - un [sac gonflable] et son générateur de gaz montés sur le volant pour le conducteur et dans la planche de bord pour le passager ; - un [boîtier électronique] (…) » ((...) (Each [airbag] system is composed of: - A [inflatable bag] and gas generator fitted on the steering wheel for the driver and in the dashboard for the passenger; - An [electronic compartiment] (...) " Affected markers: LIN3 = {composed, built, created, produced, made, given, delivered, produced, ...} LIN4 = {compound formed, ...}; LIN5 = {derived, obtained, born ...}

Search for information from the graph Navigating the graph / subgraph Selection triads [term] - (relationship) - [term] in RDF format Which allow us to connect our outputs to linked Open Data Return to text portions through links graphs <---> text portions associated = Parts of the texts that have allowed the construction of the selected sub graph.

Selection of triads [term] - (relationship) - [term]

Result of the request

Conclusions EC3 prolongs and generalizes the all the previous CE applications. Texted on many corpuses in French, English, Spanish, Arabic, Korean, Japanese Actually tested on very large and heterogeneous corpuses thanks to OBVIL Labex (UPMC/Paris-Sorbonne Universities and BNF – Bibliothèque Nationale de France) We will test GEPHI software as a graphical interface for the user.