Presentation is loading. Please wait.

Presentation is loading. Please wait.

Towards an Automatic Analyze and Standardization of Unstructured Data in the Context of Big and Linked Data Hammou FADILI(1), Christophe JOUIS (2) (1)CEDRIC,

Similar presentations


Presentation on theme: "Towards an Automatic Analyze and Standardization of Unstructured Data in the Context of Big and Linked Data Hammou FADILI(1), Christophe JOUIS (2) (1)CEDRIC,"— Presentation transcript:

1 Towards an Automatic Analyze and Standardization of Unstructured Data in the Context of Big and Linked Data Hammou FADILI(1), Christophe JOUIS (2) (1)CEDRIC, Conservatoire National des Arts & Métiers – CNAM (2)University Paris Sorbonne Nouvelle – Paris III Pierre & Marie Curie University (UPMC), LIP6, ACASA Team This présentation entitled : Towards an Automatic Analyze and Standardization of Unstructured Data in the Context of Big and Linked Data Concerns the project that we are currently implementing that aims to automatize the conversion and the integration of unstructured data of Big data in the LOD This work has been done by HF from cnam and CJ from lip6 EC3 Labs

2 The World Wide Web is a dynamic, ever-changing world
Introduction The World Wide Web is a dynamic, ever-changing world Large corpus of heterogeneous resources (Big data) 80% are in unstructured format Not annotated corpus Problematic The problem is that we can not do without these unstructured data because they are also precious more than structured ones This implies the need to implement sophisticated processes allowing the automatic analyze and exploitation of all kinds of data As you know the WWW is … Generating large corpus called big data 80% of the produced data in unstructured data Constituting not annotated corpus therefore not usable effectively. This generate many problems The problem is that we can not do without these unstructured data because they are also precious more than structured ones This implies the need to implement processes allowing the automatic analyze and exploitation of all kinds of data

3 Some keywords Objectives Motivation Our contribution Conclusion
Outline Some keywords (Linked, Big, Smart) data Objectives What we want to do and what we are doing Motivation Related works Comparison study of some existing and known solutions Our contribution The foundation of EC3 The application Conclusion In this presentation We will begin by reminding some used keywords and introducing the principal objectives of our work Followed by the description of the motivation of the proposed approach. This part is composed of two subparts: A description of some related works A comparaison of some existing and known solutions The third part will dedicated the description of our contribution. And this part composed of three subparts: Some reminders and vocabulary The foundation of the tool EC3 The implemented application The last section will conclude the presentation.

4 Linked Open Data (LOD) Big data Smart data Some keywords
is a method of publishing normalized (structured or annotated) data in order to be interlinked and queryable through semantic queries. Big data is characterized by large volumes of varied data, generated and shared quickly: 3V (Volume, Velocity, and Variety) Smart data are interpreted data, unambiguous, usable and useful, outcome generally from the results of semantic analysis of texts Linked Open Data (LOD) is a method of publishing structured or annotated data in order to be interlinked and queryable through semantic queries. Big data is characterized by large volumes of varied data, generated and shared quickly: 3V (Volume, Velocity, and Variety) Smart data are interpreted data, unambiguous, usable and useful, outcome generally from the results of semantic analysis of texts

5 Objectives We are developing an approach that consists in implementing a cyclic process Encapsulating our previous works and tools to be compatible with new standards of semantic web and LOD Exploiting the structured or annotated part of the Web (linked open data: LOD) as training data Computing semantic text analysis on the unstructured part of the Web Extract of relevant data (Smart data or interpreted data) Standardize the obtained data according to Semantic web & LOD standards, norms, etc. Connecting the new standardized data in the right place in the LOD We are developing an approach that consists in implementing a cyclic process Encapsulating our previous works and tools to be compatible with new standards of semantic web and LOD Exploiting the structured or annotated part of the Web (linked open data: LOD) as training data Computing semantic text analysis on the unstructured part of the Web Extract of relevant data (Smart data or interpreted data) Standardize the obtained data according to Semantic web & LOD standards, norms, etc. Connecting the new standardized data in the right place in the LOD

6 This section aims to composed of two parts Motivation
Explain gain our problematic through some use cases Motivate the proposed approach composed of two parts Presenting some related works via some examples in the contexts LOD Big data Smart data Presenting unresolved issues until now comparative study This section aims to Explain again our problematic through some use cases Motivate the proposed approach composed of two parts Presenting some related works via some examples in the contexts LOD Big data Smart data unresolved issues until now comparative study

7 In the context of LOD, we have 3 categories of works
Motivation / Related works / Linked Open Data (LOD) context In the context of LOD, we have 3 categories of works Exploiting LOD to analyze unstructured data Connecting new data in the LOD Combining the two approaches LODifier is one of them

8 This example combines the two approaches
Motivation / Related works / Linked Open Data (LOD) context / LODifier This example combines the two approaches The LODifier approach combines several technologies to perform semantic analysis of texts and their integration into the LOD It is based on the named entity recognition and disambiguation of words based on controlled vocabularies. Its purpose is to extract entities and relationships from text; convert them into RDF, and then link them to DBpedia or WordNet RDF. In order to simplify the presentation of examples, here the work combining the two approaches.

9 Motivation / Related works / Big data context
In the context of Big data, the aim is to convert unstructured data to their structured and semantically annotated format. In this context, we have 2 categories Based on the analysis of texts as raw materials NLP, semantic analysis based on a middle layer like NOSQL Low structured data (key-value) In the context of Big data, the related works are of two categories: These performing unstructured data analysis using text analysis technics like: NLP, etc. These based on a middle layer like NOSQL that manage Low structured data

10 Motivation / comparative study
We have performed a comparative study of several academic and industrial tools of NLP and semantic analysis. This by submitting ambiguous sentences or texts and analyzing the results. These results show that the various solutions support almost NLP basic techniques in their early stages of analysis But beyond that, each solution implements its own technology to improve the management of semantic analysis and its integration in various fields (indexing, extracting, and searching), in Big data exploitation, enrichment and / or exploitation of Linked open data, etc For the management of semantics, most of the tools support the named entity recognition; some go a little beyond, but without the full and effective management of semantics So it is at this level that we want to contribute

11 Analyze the non structured part of the Web is very difficult
Motivation / conclusion These elements related to the motivation helped us to well understand our problematic and to conclude that Analyze the non structured part of the Web is very difficult Many solutions began to use one part (normalized) of the web to standardize the other part (non normalized) Many problems remain unresolved, particularly in terms of semantic analysis It is in this context that we are currently constructing a new approach, based on two important notions, namely the notion of context and the notion of semantic relations These elements related to the motivation helped us to well understand our problematic and to conclude that Analyze the non structured part of the Web is very difficult Many solutions began to use one part of the web to standardize the other part Many problems remain unresolved, particularly in terms of semantic analysis It is in this context that we are currently constructing a new approach, based on two important notions, namely the notion of context and the notion of semantic relations

12 First, a question to the assistance !!! What is exactly a “sentence” ?
Our contribution : EC3 (Exploration Contextuelle – EC - Contextual Exploration) First, a question to the assistance !!! What is exactly a “sentence” ? Goal : Extract pertinent information from heterogeneous texts from a point of view such as : Semantic relations between Named Entities, Who speak about what from which ? (citations, …) Method : No syntactic analyses, No “universal” or “domain-dependent’ dictionaries, Focus on “empty” words that are in fact the real knowledge of a language.

13 EC3 : Foundations Applicative and Cognitive Grammar(Desclés, Paris-Sorbonne) 3 Levels of representation: Morpho-syntactic structures : Not universal, depend on language, Applicative or predicative structures : Descriptions are presented in the form of applicative expressions (operators applied to operands from different types) : OPL (P a1 a2… an) : Not universal, depend on language, Cognitive level : meanings of linguistic units may be analyzed under the form of layouts (semantic-cognitive representations) in order to constitute the knowledge representations associated to a given text : Universal.

14 EC3 : Main Idea Connectors between terms is a set of language indicators to isolate semantic relationships between terms / concepts. These “indicators” reflect linguistic knowledge independent of a particular field of knowledge.

15 EC3 : architecture

16 EC3 software - KBS Consists of contextual exploration rules:
IF [conditions] THEN [conclusions] [conditions]: express the co-presence or explicit absence of relevant linguistic units in the same context (part of text, proposition, paragraph, etc.) [Conclusions]: progressive constructions of representations (semantic relationships between words) Search in … sequences of linguistic units

17 EC3 Software We have implemented Example: RULE ing48
200 rules (for French) 3000 Markers Example: RULE ing48 LET x1, x2, x3, x4 linguistic units(markers), P a sequences of linguistic units IF x1 is an occurrence of the verb "être" or one of the symbol [,], [.] or [-],] AND x2 is in lists LIN3, LIN4 or LIN5 (see slide after) AND x3 is in {[avec], [de], [par]} AND x4 is le linguistic unit [de] AND x1 x2 x3 x4 follow in P THEN It is the semantic relation part/of in P.

18 EC3 : detection of a relation part/of
« (…) Chaque système [Airbag] est composé de : - un [sac gonflable] et son générateur de gaz montés sur le volant pour le conducteur et dans la planche de bord pour le passager ; - un [boîtier électronique] (…) » ((...) (Each [airbag] system is composed of: - A [inflatable bag] and gas generator fitted on the steering wheel for the driver and in the dashboard for the passenger; - An [electronic compartiment] (...) " Affected markers: LIN3 = {composed, built, created, produced, made, given, delivered, produced, ...} LIN4 = {compound formed, ...}; LIN5 = {derived, obtained, born ...}

19 Search for information from the graph
Navigating the graph / subgraph Selection triads [term] - (relationship) - [term] in RDF format Which allow us to connect our outputs to linked Open Data Return to text portions through links graphs <---> text portions associated = Parts of the texts that have allowed the construction of the selected sub graph.

20 Selection of triads [term] - (relationship) - [term]

21 Result of the request

22 Conclusions EC3 prolongs and generalizes the all the previous CE applications. Texted on many corpuses in French, English, Spanish, Arabic, Korean, Japanese Actually tested on very large and heterogeneous corpuses thanks to OBVIL Labex (UPMC/Paris-Sorbonne Universities and BNF – Bibliothèque Nationale de France) We will test GEPHI software as a graphical interface for the user.


Download ppt "Towards an Automatic Analyze and Standardization of Unstructured Data in the Context of Big and Linked Data Hammou FADILI(1), Christophe JOUIS (2) (1)CEDRIC,"

Similar presentations


Ads by Google