Download presentation
Presentation is loading. Please wait.
Published bySibyl Gardner Modified over 9 years ago
1
Achieving Semantic Interoperability at the World Bank Designing the Information Architecture and Programmatically Processing Information Denise Bedford June 28, 2005
2
Presentation Overview ► What semantic interoperability means to us Multiple dimensions – functional dimension, domain dimension Multiple challenges – semantics, syntax, data structure ► Strategy for achieving SI Ensure that the tools and technologies are used in the most effective and efficient way Ensure that domain experts are involved in the work Maintain an open and highly flexible information environment Build IQ standards into reference structures which can be leveraged across many applications and all kinds of content ► Practice and current status of work
3
Functional Dimension ► Higher level information access functionality Recommender systems, content syndication, personalization, visualization, etc. focused on user interests and content – regardless of the system context ► Enterprise search and browse across systems, across all types of content, across languages, functional requirements for enterprise search ► Enterprise publishing from any system to any other system Without redundancy of content, maintaining respect for intellectual attribution, records management compliance, disclosure compliance, … ► Enterprise content creation & management Implementing semantic interoperability and information quality standards at the point when an object is born digital throughout its life cycle ► Publishing and sharing our domain-based semantic networks with others working in the same domain to support collaboration
4
Strategy ► Begin with the conceptual modeling task (Erwin models) ► Work at the attribute level (attribute reference maps, specifications) ► Identify and reconcile the syntax problems (among the biggest challenges initially) ► Address the semantic problems from a master and reference data store perspective ► Build the enterprise metadata repository and enterprise search system ► Establish governance processes at the attribute level – each attribute has a different type of behavior, steward – governance follows behavior
5
SI Practice Level ► Key to success at the practical level is understanding how to use the technologies to greatest advantage ► UML to model entities and relationships you have to start with a baseline idea of entities and relationships which you refine over time In addition you must have an information architecture to frame your SI solutions Without the information architecture framework you’re just doing testing and exploration ► Concept and entity extraction to: form base domain vocabularies (both entity and relation vocabularies for domains) Help scope and define the boundaries of the domain Define pattern matching rules for some types of entities ► ‘Seeded clustering’ to understand and build semantic relationships among concepts and entities within a well-defined domain ► Categorization tools to define concept-level profiles for domains in order to programmatically classify content to domains ► Summarization and gisting technologies to support human relevance judgments and publishing
6
Smart Use of Technologies ► Sample structure Oracle data classes used to represent Topic Classification scheme ► hierarchical taxonomy as reference source for the attribute – Topic ► used for Browse, Search, Content Syndication, Personalization 1 st challenge is to architect the hierarchy correctly ► 3 distinct data classes, not a tree structure with inheritance ► Allows you to use the three data classes for distinct functions across systems but still enforce relationships across the classes Example: Topic, Subtopic, Subsubtopic structure in Oracle
7
3 Oracle Data classes
8
Topic data class
9
Subtopic Data Class
10
Subsubtopic Data class
11
Relationships across data classes
12
Leveraging the Structure ► Each subtopic is a knowledge domain ► Each subtopic has an extensive concept level definition (1,000 – 5,000+ concepts) ► Concepts are controlled vocabularies in their raw form ► Concepts with relationships (extensive per new Z39.19 standard) comprise semantic network ► Categorization tools work with topic structure & concept definitions to categorize and index content ► The following screen shot illustrates how that same structure is embedded into Teragram profile to support categorization
13
Subtopics Domain concepts
14
Extensive operators allow us to write grammatical rules to manage typical semantic problems
15
Concept based rules engine allows us to define patterns to capture other kinds of data
16
Example of use of Authority Control to capture country names but extract ‘authorized’ version of country name Example of use of a gazetteer + concept extraction + rules engine to support semantic interoperability
17
Use of concept extraction + rules engine to capture Loan #, Credit #, Project ID#
18
Processed Content ► Let’s look at some examples of content which has been programmatically processed ► Topic classification, geographical region assignment, keywording examples ► Can apply this approach to any kind of content ► Enables us to build a robust metadata repository model, with strong metadata quality, to move towards SI at the functional level ► Also note that we can do this across many languages
19
Impacts & Outcomes ► Information Access impacts Increased precision of search Better control over recall Searching like we talk Exact match searching – known item searching now a reality Metadata based searching now begins to resemble full-text searching but with all the advantages of structure & context, and a significant reduction in the amount of noise ► Productivity Improvements Can now assign deep metadata to all kinds of content Remove the human review aspect from the metadata capture Reduce unit times where human review is still used ► Information Quality impacts All metadata carries the information architecture with it Apply quality metrics at the metadata level to eliminate need to build ‘fuzzy search architectures’ – these rarely scale or improve in performance Use the technologies to identify and fix problems with our data
20
Progress To Date ► Operational in two systems – document management and library of learning ► Retrospectively processed 60,000 documents in 30 hours last weekend – dramatic improvement to access, quality and increased semantic interoperability potential ► Beginning the reprocessing of 3.7+ million documents in our records management system – adding metadata to support search, enable browse/search, capture metadata in language of the document to support cross-language searching (expected 3 month duration) ► Reprocessing web content by adding deep metadata following the records management system project ► System by system, implementing rich enterprise profile, we achieve a very high quality degree of semantic interoperability
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.