Download presentation
Presentation is loading. Please wait.
Published byPeter Rodgerson Modified over 9 years ago
1
Semantic Indexing and Search for Content Management Systems Suat Gönül, SRDC suat@apache.org A. Anil Sinaci, SRDC sinaci@apache.org prepared by presented by 1 / 36
2
About me PhD student at Senior Software Engineer at FP7 Projects: 2 / 36
3
Agenda Stanbol Overview Main scenario Two-layered architecture Storage Revision management Indexing Application of the architecture 3 / 36
4
Stanbol Overview Semantic Services Traditional CMS Different means of interaction 4 / 36
5
Stanbol Design RESTful API OSGi Services 5 / 36
6
Semantic Indexing & Search Scenario Enhanced content Plain content 6 / 36
7
CMS Structure Plain content 7 / 36
8
Configuring Index 8 / 36
9
Domain Specific Enhancements … <rdf:Description rdf:about="urn:enhancement-810e14ac-af3d-3310-b9bb- 74233df94ef5"> 0.8 Heart disease <j.2:created rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2012-04-18T05:34:27.329Z <j.2:creator rdf:datatype="http://www.w3.org/2001/XMLSchema#string">org.apache.stanbol. enhancer.engines.keywordextraction.engine.KeywordLinkingEngine … 9 / 36
10
Semantic Indexing & Search Enhanced content 10 / 36
11
Two-layered Architecture 11 / 36
12
Storage Layer Plain content Content check CMS Enhanced content 12 / 36
13
Storage Layer Separation of Storage layer provides Storage of items having different granularities ContentItem, Entity Opportunity of integration with standard based CMSes JCR, CMIS 13 / 36
14
Revision Management IndexingSource keeps track of revisions of stored items When an item is created, create also a corresponding revision When an item is updated or deleted increase the corresponding, existing revision Used while synchronizing the Storage and Indexing layers 14 / 36
15
Revision Management - ChangeSet IndexingSource provides the list of changed items based on a given revision through the ChangeSet structure getChanges(revision:long, batchSize:int):ChangeSet ChangeSet keeps IDs of items not items themselves 15 / 36
16
Revision Management - ChangeSet An example: After step 1: 1 : urn:contentItem.1 //added 1 : urn:contentItem.2 //added 1 : urn:contentItem.3 //added After step 2: 1 : urn:contentItem.1 2 : urn:contentItem.2 //updated 2 : urn:contentItem.3 //removed After step 3: 1 : urn:contentItem.1 2 : urn:contentItem.2 3 : urn:contentItem.3 //added 3 : urn:contentItem.4 //added getChanges(0, …) => {urn:contentitem.1, urn:contentitem.2, urn:contentitem.3, urn:contentitem.4} getChanges(2, …) => {urn:contentitem.3, urn:contentitem.4} 16 / 36
17
Revision Management - Epoch IndexingSource keeps an epoch indicating a common revision for all stored items A change of the epoch means that whole data source of the IndexingSource has changed For instance, data source of the IndexingSource is replaced with the new DBPedia dump Revision registry is repopulated for the new items 17 / 36
18
Indexing Layer 18 / 36
19
Indexing Layer – Semantic Index Poll changes Notify changes Incremental indexing Persists the revision of latest indexed item and corresponding epoch Requests changes as of the latest persisted revision In case of epoch change, changes are requested from scratch 19 / 36
20
Indexing Layer – Semantic Index State of the index UNINIT INDEXING ACTIVE REINDEXING 20 / 36
21
OSGi Based Design 21 / 36
22
Application of Two-layered Approach The first step is interaction with already deployed CMS JCRIndexingSource connects to the CMS using the standard JCR API Checks the content in the CMS CRX 22 / 36
23
Application of Two-layered Approach The JCRIndexingSource wraps a CMS document as a ContentItem to be processed by Stanbol. … Heart valves keep blood flowing in a one-way direction by opening to let the proper amount of blood flow through and then closing to prevent backflow. From the right ventricle, blood is pumped through another valve and then into the lungs, where it receives oxygen. … 23 / 36
24
Application of Two-layered Approach Corresponding enhanced ContentItem … Heart valves keep blood flowing in a one-way direction by opening to let the proper amount of blood flow through and then closing to prevent backflow. From the right ventricle, blood is pumped through another valve and then into the lungs, where it receives oxygen … {jcr:lastModifiedBy=admin, jcr:lastModified=2012-06- 08T11:16:38,jcr:mimeType=text/plain,…} {"@subject": "urn:enhancement-33ccf589-9982-3cba-4c88-8281b73e1096", "@type": ["Enhancement", "TextAnnotation"], "confidence": 0.6, "created": "2012-06-08T09:09:10.474Z", "creator": "org.apache.stanbol.enhancer.engines.keywordextraction.engine.KeywordLinkingEngine", "end": 1529, "extracted-from": "urn:content-item-b38798a5-9856-485b-880b-6f8a6a0b9eb7", "selected-text": { "@literal": "of blood flow", "@language": "en" }, "selection-context": { "@literal": "Heart valves keep blood flowing in a one-way... "@language": "en" },},... 24 / 36
25
Application of Two-layered Approach In the second step, LDPath is used to configure an index configuration. 25 / 36
26
Application of Two-layered Approach In the second step, LDPath is used to configure an index configuration. 26 / 36
27
LDPath Using the previous LDPath, a Solr based SemanticIndex will be configured. Each line in the LDPath instance provides configuration of an index field snomed_has_finding_site = umls-skos:has_finding_site :: xsd:string; Index field name RDF path to be processed for the current RDF resource Field configurations. 27 / 36
28
Application of Two-layered Approach Collection additional information for recognized named entities … <rdf:Description rdf:about="urn:enhancement-810e14ac-af3d-3310-b9bb- 74233df94ef5"> 0.8 Heart disease <j.2:created rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2012-04-18T05:34:27.329Z <j.2:creator rdf:datatype="http://www.w3.org/2001/XMLSchema#string">org.apache.stanbol. enhancer.engines.keywordextraction.engine.KeywordLinkingEngine … 28 / 36
29
Application of Two-layered Approach The same LDPath, which was used to configure the index, is used to get additional information for recognized named entities 14O7. C0580320 161639008 T033 At risk of disease Heart disease Finding snomed_isa = umls-skos:isa :: xsd:string; snomed_semantic_type = snomed:semanticType :: xsd:string; 29 / 36
30
Application of Two-layered Approach All semantically related knowledge is indexed along with the actual content in the customized index The underlying SOLR index can be queried directly through its RESTful services 30 / 36
31
Application of Two-layered Approach Faceted search interface over the index 31 / 36
32
Application of Two-layered Approach All facets 32 / 36
33
Application of Two-layered Approach Faceted search interface over the index 33 / 36
34
Application of Two-layered Approach Faceted search interface over the index 34 / 36
35
Application of Two-layered Approach Faceted search interface over the index 35 / 36
36
Thank you for listening! Questions? 36 / 36
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.