Download presentation
Presentation is loading. Please wait.
1
Aramis Rodríguez Blanco
Extraction of Linked Data from unstructured information applying PLN techniques and Ontologies Authors: Aramis Rodríguez Blanco Alfredo Simón Cuevas Wenny Hojas Mazo Jose M. Perea Ortega
2
Introduction Rising of smart applications: Semantic search
Popularity of web technologies Web content reorganized in semantic data Web focused on documents Web focused on interconnected data Linked Data Search, analysis, knowledge engineering, integration, information retrieval
3
Problematic Linked Data Extracted from:
Semi-structured sources: Wikipedia (DBpedia), Structured sources: databases, Unstructured sources: text, English approaches, Information extraction limitations,
4
Proposed approach Extraction of Linked Data from unstructured text in English and Spanish, based on automatic extraction of conceptualization from text as Concept Map, transformed to RDF data model. Concept Map Ontology
5
Conceptualization Extraction
Proposed approach phases Pre-processing (NLP techniques). Extraction of conceptualization (NLP techniques and ontologies). Generation of RDF Model (rules based on patterns). HTML Text Pre-processing Plain text extraction Language recognition Segmentation POS Tagging Morpho-Syntactic Analysis Conceptualization Extraction OWL Ontology Concepts Extraction Links Extraction Normalization RDF triples RDF Data Model Generation
6
Pre-processing Text extraction from documents (pdf,txt,doc,html).
Text segmentation (by sentences). Language recognition (English or Spanish). Syntactic analysis of text (using FreeLing parser). Token segmentation (separate words). Morphological syntactic analysis (extract lemma and POS tagging). Shallow parsing analysis (group tokens in noun groups, adjetive groups, verb groups, etc.). Deep parsing analysis (dependency tree).
7
Conceptualization extraction
Concept extraction Text mapping of terms represented as classes or instances in OWL ontology. Use of predefined patterns which are applied on noun or adjective groups (Patterns were defined from DBpedia ontology). Patterns Examples NN processes Z NN Five files NNP Ernest Hemingway JJ impressive JJ NN Informatics system JJ JJ NN Operative logical processor NN NNP Chief Alfred Z JJ NN Three efficient methods Legend: NN: common noun; NNP: proper noun; Z: number or numeral; JJ: adjective
8
Conceptualization extraction
Relations extraction Identification of explicit relations in dependency tree. Identification of concept nodes and linking phrase. Identification of different kinds of relations (mainly verbal and prepositional). Patterns Examples SN0 ‘such as’ {SN1,SN2…,(and|or)} SNn …animals such as dog and bird. SN1 ‘is a kind of’ SN0 concept map is a kind of concept graph SN1 ‘is a’ SN0 dolphins are smart animals SN {,} ‘has’ {SN,}+{and|or} SN …humans and monkeys have similar characteristics Legend: SN: noun group; +: concatenation of elements; |: disjunction of elements; { }: optional elements; ( ): group of elements
9
Conceptualization extraction
Relations extraction Identification of implicit relations through relations in OWL ontology. Identification of taxonomic relation between concepts C0 y C1 if both are ontological classes. Other semantic relations if: C0 and C1 are related by a Object property relation. Identification of class-instance relations.
10
RDF Model Generation Every relations is coded as a triplet in RDF data Model. The source concept is coded as subject and destiny concept is coded as object, and the linking phrase is coded as predicate. URIs in triplets are built from URLs of processed html file, the URL of the own file, as those from hyperlinks related to terms represented as concepts.
11
Characterization of test collections:
Results and discussion Evaluation focused on the amount and quality of the information extracted from text (concepts and relations), using or not the ontology. Abstracts from corpuses DBPedia_ES and DBPedia_EN: and IA: Characterization of test collections: Characteristics DBpedia_ES DBpedia_EN IA Language Spanish English Documents 50 6 Word average 75,9 75,2 1111,83 Sentence average 3,0 3,4 46,83
12
Results and discussion
Test method without and with the ontological model. DBPedia_EN and IA were tested with available ontology. Evaluation through human evaluators, considering aspects such as: amount of information extracted, Contribution level of use of ontology, Accuracy in concept and relations extracted. Right concept: nouns or adjectives with important meaning in text, names of entities or phrases with sense. Right proposition: can be interpreted with own sense (ex. When both concepts are correct and link is well defined).
13
Results and discussion
From information provided by evaluators, accuracy was calculated as the reason between extracted correctly and total. Aspects DBpedia_ES DBpedia_EN IA NoOnt YesOnt AC 7,88 10,06 10,2 112,4 120,2 AE 2,2 4,86 5,08 29,2 36,2 AR 6,10 8,38 8,48 103 114,8 ACOnt. - 3,72 37,2 AROnt. 8,6 PC 93,66 89,21 92,58 96,81 98,57 PR 53,92 51,26 53,97 66,67 83,10 Legend: AC: Average amount of concepts; AE: Average amount of entities; AR: Average amount of relations; CCOnt.: Prom. cant. conceptos obtenidosde la ontología; AROnt.: Average amount of relations identified in ontology; PC: Precision in concept identification; PR: Precision in relations identification; NoOnt: Not usig ontology; YesOnt: using ontology.
14
Case of use To exemplify the method it is selected a text abstract from DBpedia_ES. Then, it is built the map, and finally the RDF model La bronquitis crónica es una enfermedad inflamatoria de los bronquios respiratorios asociada con exposición prolongada a irritantes respiratorios no específicos, incluyendo microorganismos y acompañado por hipersecreción de moco y ciertas alteraciones estructurales en el bronquio, tales como fibrosis, descamación celular, hiperplasia de la musculatura lisa, etc.
15
Case of use Map created automatically from previous sentence:
16
Case of use RDF model obtained automatically by the method:
<rdf:RDF xmlns:rdf=" xmlns:owl=" xmlns:rdfs=" xmlns:daml=" > <rdf:Description rdf:about="owl:hiperplasia_de_musculatura_liso"> <rdfs:subClassOf rdf:resource="owl:alteración_estructural"/> <rdf:type rdf:resource=" </rdf:Description> <rdf:Description rdf:about="owl:irritante_respiratorio_específico"> <rdf:Description rdf:about="owl:bronquitis_crónico"> <rdfs:subClassOf rdf:resource="owl:enfermedad_inflamatorio_de_bronquio_respiratorio"/> <rdf:Description rdf:about="owl:microorganismo"> <rdfs:subClassOf rdf:resource="owl:irritante_respiratorio_específico"/> <rdf:Description rdf:about="owl:fibrosis"> <rdf:Description rdf:about="owl:descamación_celular"> <rdf:Description rdf:about="owl:enfermedad_inflamatorio_de_bronquio_respiratorio"> <rdf:Description rdf:about="owl:alteración_estructural"> </rdf:RDF>
17
Case of use RDF model obtained automatically by the method:
18
Conclusions It is presented a method to extract Linked Data from unstructured texts in English and Spanish, and not domain dependent. It is combined NLP techniques for text processing and information extraction, allowing to obtain more information. Use of linguistic patterns to identify concepts and semantic relations between them, and dependency parsing, which has let to gain in text coverage. Concepts extraction, including entities, as the semantics links identified between them enrich Linked Data.
19
Conclusions Tests show promising results with over 90% in concept identification and over 50% in relations identification. It is shown the benefits of using an ontological reference model, based on the amount of information extracted, and the improve in accuracy.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.