Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank.

Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

Semantic Interoperability (SI) Semantic interoperability means different things to different people primarily because the context is always different Semantics – –Resolved at the understanding and reasoning level –Word level, Concept level, Language level, Grammatical level, Domain Vocabulary level, Representation level Interoperability –Resolved at the architecture level –Different sources using different semantics

What Does SI Look Like? Answer to this question is always, “It depends…” Achieving semantic interoperability means that the semantic and the interoperability challenges are resolved at the system level – not at the user level Practical examples –Cross application discovery –Cross language discovery –Recommender engines –Workflow management –Scenario inferencing Let’s look at a high level model of the enterprise search model and find the SI points

TRIM Archives Transformation Rules/Maps People Soft SIRSI System InfoShop Metadata SAP Financial System Web Content Mgmt. Metadata Metadata Repository Of Bank Standard Metadata (Oracle Tables & Indexes) World Bank Catalog/ Enterprise Search (Oracle Intermedia) World Bank Catalog/ Enterprise Search (Oracle Intermedia) Site Specific Searching Site Specific Searching Publications Catalog Publications Catalog Recommender Engines Recommender Engines Personal Profiles Personal Profiles Portal Content Syndication Portal Content Syndication Metadata Extract Metadata Extract Metadata Extract Metadata Extract Metadata Extract Metadata Extract Browse & Navigation Structures Browse & Navigation Structures Concept Extraction, Categorization & Summarization Technologies Metadata Extract IRIS Oracle Vision of Semantic Interoperability Reference Tables (CDS+) Topics, Countries Document Types (Oracle data classes) Data Governance Bodies Data Governance Bodies

Basic Assumptions and Constraints There are many layers of semantic challenges between the user experience and architecture Ideally, semantic interoperability is grounded in your enterprise architecture – regardless of the level of sophistication of your enterprise architecture Semantic interoperability is a question of degree - some of the layers are interoperable at the enterprise level and others may be at a local level Some layers may be universal – beyond the enterprise – and others are by definition limited to the enterprise

Managing Interoperability Challenges Option 1: Integrate, map and reconcile at a superficial level –Reference mappings –Continuous monitoring – always after the fact –Consultation and reconciliation and fixing –SI solution is always a partial solution Option 2: Provide the capability to generate semantically interoperable solutions early in the development stages –Use the technologies to model what people would do if they had unlimited time and resources –Develop consistent profiles which distributed throughout an enterprise, but managed centrally –Govern and manage the profiles, not the ‘mess’

Combining Options Option 1 is feeding the beast – you never get ahead and it consumes resources you could use for other products and services My experience is that we have to use both options –Mapping and managing the legacy data unless you can recon –Trying to push a programmatic solution for new content –At least trying to stop the reconciliation at a given point in time I’d like to talk first about the idea behind the architecture and second, about the actual semantic methods

Teragram Tools Teragram is a company located in Boston and Paris which offers COTS natural language processing (NLP) technologies Teragram’s Natural Language Processing technologies include: –Rules Based Concept Extraction (also called classifier) –Grammar Based Concept Extraction –Categorization –Summarization –Clustering –Language detection Semantic engines are available in 30+ languages

Teragram Use Operationalized in the System –IRIS – Retrospective Processing –ImageBank – daily processing of incoming documents –Structured service descriptions – terse text Self-Service Model –WBI Library of Learning –Africa Region Operations Toolkit –External Affairs – eLibrary –External Affairs – Media Monitoring –External Affairs – Disease Control Priorities Website –ICSID -- Document Management –PICs MARC Record attributes –Web Archives metadata

Structured & Unstructured Data Range of formats processed –Anything in electronic format – MS Office, html, xml, pdf, … Range of types of text processed –17M pdf documents –Very short structured service descriptions Different writing styles –Formal publications, internal informal emails, web pages, data reports Depending on what you are trying to do with the data – may or may not have to adjust the profile and your strategy Most important consideration, though, is the nature of the writing style – informal requires some adjustments

Business Drivers In order to get ahead of the problem, we decided to: ‘Institutionalize’ the Teragram profiles so that outputs are consistently generated across applications and content Have a single installation of the technologies to ensure consistent management and efficient maintenance Allow different systems to call and consume the outputs from the technologies while using the same profiles Avoid tight integration of the Teragram technologies with any existing system

Teragram Components & Configuration Enterprise Profile Development & Maintenance IQ Teragram Team TK240 Client Master Data Stores Authority Lists Taxonomies Controlled Vocabularies Concept Profile Categorization Profile Concept List for Clustering Summarization Rules File Training Sets Testing Sets Language Grammars Concept Engine Categorization Engine Summarization Engine Clustering Engine XML formatted output

ImageBank Integration Content Capture ISP Integration Enterprise Profile Development & Maintenance XML Wrapped Metadata Dedicated Server – Teragram Semantic Engine – Concept Extraction, Categorization, Clustering, Rule Based Engine, Language Detection APIs & Integration APIs & Integration Content Capture XML Wrapped Metadata Factiva Metadata Database IRIS Integration APIs & Integration Enterprise Metadata Capture Strategy TK240 Client XML Output e-CDS Reference Sources APIs & Technical Integration Content Owners Business Analyst IDU IndexersSITRC Librarians IRIS Functional Team Enterprise Metadata Capture – Functional Reference Model

Information Architecture Best Practices Build profiles at the attribute level so that everyone can use the same profile and there is only one profile to maintain Each calling system, though, can specify the attributes that they want to use in their processing –ImageBank can specify Topics and Keywords –WBI can specify Topics, Keywords, Country, Regions –Media Monitoring can specify Topics, Organization Names, People Names –eLibrary can specify Author, Title, Publisher, Publication Date, Topics, Library of Congress Class No. Each of these users is calling the same Topic profile even though their overall profiles are different

Enterprise Profile Development & Maintenance Enterprise Metadata Profile Concept Extraction Technology Country Organization Name People Name Series Name/Collection Title Author/Creator Title Publisher Standard Statistical Variable Version/Edition Categorization Technology Topic Categorization Business Function Categorization Region Categorization Sector Categorization Theme Categorization Rule-Based Capture Project ID Trust Fund # Loan # Credit # Series # Publication Date Language Summarization e-CDS Reference Sources for Country, Region, Topics Business Function, Keywords, Project ID, People, Organization Data Governance Process for Topics, Business Function, Country, Region, Keywords, People, Organizations, Project ID Teragram Team TK240 Client ISP IRISImageBank Factiva JOLIS E-Journals Enterprise Profile Creation and Maintenance UCM Service Requests Update & Change Requests

Now For the Semantics…

Context I will use today a simple application to illustrate the problems and the solutions Context is programmatic capture of high quality, consistent, persistent, rich metadata to support parametric enterprise search Parametric enterprise search looks simple but there are a lot of underlying semantic problems Implementation has expanded beyond core metadata at this point in time and continues to grow but that’s another discussion – also expanding into other languages

Cross Application Information Discovery

World Bank Core Metadata Identification/ Distinction Use Management Compliant Document Management Human CreationProgrammatic Capture Inherit from System Context Extrapolate from Business Rules Search & Browse

Semantic Methods Each of these parameters presents a different kind of semantic challenge Need to find the right semantic solution to fit the semantic problem Semantic methods should always mirror how a human approaches, deconstructs and solves the semantic challenge Purely statistical approaches to solving semantic problems are only appropriate where a human being would take a statistical approach Mistake we have made in the profession is to assume that statistical methods can solve semantic problems – they cannot

NLP Technologies – Two Approaches Over the past 50 years, there have been two competing strategies in NLP - statistical vs. semantic In the mid-1990’s at the AAAI Stanford Spring Workshops it was agreed by the active practitioners that the statistical NLP approach had hit a rubber ceiling – there were no further productivity gains to be made from this approach About that time, the semantic approach showed practical gains – we have been combining the two approaches since the late 1990’s Most of the tools on the market today are statistical NLP, but some have a more robust underlying semantic engine

Problem with Statistical NLP We experimented with several of these tools in the early 2000s – including Autonomy, Semio, Northern Lights Clustering – but there were problems –the statistical associations you generate are entirely dependent upon the frequency at which they occur in the training set –Without a semantic base you cannot distinguish types of entities, attributes, concepts or relationships –If the training set is not representative of your universe, your relationships will not be representative and you cannot generalize from the results –If the universe crosses domains, then the data that have the greatest commonality (least meaning) have the greatest association value

Semantic NLP For years, people thought the semantic could not be achieved so they relied on statistical methods The reason they thought it would never be practical is that it took a long time to build the foundation – understanding human language is not a trivial exercise Building a semantic foundation involves: –developing grammatical and morphological rules – language by language –Using parsers and Part of Speech (POS) taggers to semantically decompose text into semantic elements –Building dictionaries or corpa for individual languages as fuel for the semantic foundation to run on –Making it all work fast enough and in a resource efficient way to make it economically practical

Example of Semantic Analysis

Problem with Statistical Tools There are problems with the way the statistical tools are packed in tools… –Resource intense to run – to cluster 100 documents may take several hours and give you suboptimal results –Results are dynamic not persistent - you can’t do anything else with the results but look at them and point back to the documents –They only live in the index that was built to support the cluster and generally are not consumable by any other tools –Outputs are not persistently associated with the content –We wanted to generate persistent metadata which could then be manipulated by other tools

Implementing Teragram The package consists of a developers client (TK240) and multiple servers to support the technologies Client is the tool we use to build the profiles/rules – server interprets the rules Recall the earlier model of enterprise profiles Each attribute is supported by its own profile – there is a profile for countries, one for regions, one for topics, one for people names, and so on We keep a ‘table’ of the profiles that any application uses – call the profiles at run time Language profiles are separate – English, French, Spanish, …

Implementing Teragram The first step is not applying the tool to content, but analyzing the semantic challenge Understand how a person resolves the semantic problem - then devise a machine solution that resembles the human solution The solution involves selecting a tool from the Teragram set, building the rules, testing and refining the rules, then rolling out as QA for end user review End user feedback and signoff is important – helps build confidence and improves the quality of the result Depending on the complexity of the problem and whether the rules require a reference source, putting the solution together might take a week to two months

Examples of Solutions There are different kinds of semantic tools – you have to find the one that suits your semantic problem Let’s look at some solution examples: –Rules Based Concept Extraction –Grammar Based Concept Extraction –Categorization –Summarization –Clustering –Language detection As I talk about each solution, I’ll describe what we tried that didn’t work, as well as what did work in the end

Rule Based Concept Extraction What is it? –Rule based concept or entity extraction is a simple pattern recognition technique which looks for and extracts named entities –Entities can be anything – but you have to have a comprehensive list of the names of the entities you’re looking for How does it work? –It is a simple pattern matching program which compares the list of entity names to what it finds in content –Regular expressions are used to match sets of strings that follow a pattern but contain some variation –List of entity names can be built from scratch or using existing sources – we try to use existing sources –A rule-based concept extractor would be fueled by a list such as Working Paper Series Names, edition or version statement, Publisher’s names, etc. –Generally, concept extraction works on a “match” or “no match” approach – it matches or it doesn’t –Your list of entity names has to be pretty good

Rule Based Concept Extraction How do we build it? 1.Create a comprehensive list of the names of the entities – most of the time these already exist, and there may be multiple copies 2.Review the list, study the patterns in the names, and prune the list 3.Apply regular expressions to simplify the patterns in the names 4.Build a Concept Profile 5.Run the concept profile against a test set of documents (not a training set because we build this from an authoritative list not through ‘discovery’) 6.Review the results and refine the profile State of Industry –The industry is very advanced – this type of work has been under development and deployed for at least three decades now. It is a bit more reliable than grammatical extraction, but it takes more time to build.

Rules Based Concept Extraction Examples Loan # Credit # Report # Trust Fund # ISBN, ISSN Organization Name (companies, NGOs, IGOs, governmental organizations, etc.) Address Phone Numbers Social Security Numbers Library of Congress Class Number Document Object Identifier URLs ICSID Tribunal Number Edition or version statement Series Name Publisher Name Let’s look at the Teragram TK240 profiles for Organization Names, Edition Statements, and ISBN

Replace this slide with the ISBN screen – with the rules displayed Concept based rules engine allows us to define patterns to capture other kinds of data ISBN Concept Extraction Profile – Regular Expressions (RegEx) Use of concept extraction, regular expressions, and the rules engine to capture ISBNs. Regular expressions match sets of strings by pattern, so we don’t need to list every exact ISBN we’re looking for.

Classifier concept extraction allows us to look for exact string matches List of entities matches exact strings. This requires an exhaustive list– but gives us extensive control. (It would be difficult to distinguish by pattern between IGOs and other NGOs.)

Another list of entities matches exact strings. In this case, though, we’re making this into an ‘authority control list’– We’re matching multiple strings to the one approved output. (In this case, the AACR2-approved edition statement.)

Grammatical Concept Extractions What is it? –A simple pattern matching algorithm which matches your specifications to the underlying grammatical entities –For example, you could define a grammar that describes a proper noun for people’s names or for sentence fragments that look like titles How does it work? –This is also a pattern matching program but it uses computational linguistics knowledge of a language in order to identify the entities to extract – if you don’t have an underlying semantic engine, you can’t do this type of extraction –There is no authoritative list in this case – instead it uses parsers, part-of- speech tagging and grammatical code –The semantic engine’s dictionary determines how well the extraction works – if you don’t have a good dictionary you won’t get good results –There needs to be a distinct semantic engine for each language you’re working with

Grammatical Concept Extractions How do we build it? –Model the type of grammatical entity we want to extract and use the grammar definitions to build a profile –Test the profile on a set of test content to see how it behaves –Refine the grammars –Deploy the profile State of Industry –It has taken decades to get the grammars for languages well defined –There are not too many of these tools available on the market today but we are pushing to have more open source –Teragram now has grammars and semantic engines for 30 different languages commercially available –IFC has been working with ClearForest Let’s look at some examples of grammatical profiles – People’s Names, Noun Phrases, Verb Phrases, Book Titles

TK240 Grammars for People Names Grammar concept extraction allows us to define concepts based on semantic language patterns.

Grammatical Concept Extraction file W:/Concept Extraction/Media Monitoring Negative Training Set/ 001B950F2EE8D0B4452570B4003FF816.txt PEOPLE_ORG Abdul Salam Syed, Aruna Roy, Arundhati Roy, Arvind Kesarival, Bharat Dogra, Kwazulu Natal, Madhu Bhaduri, 7 Proper Noun Profile for People Names uses grammars to find and extract the names of people referenced in the document.

Grammatical Concept Extraction – People Names Client testing mode

Rule-Based Categorization What is it? –Categorization is the process of grouping things based on characteristics –Categorization technologies classify documents into groups or collections of resources –An object is assigned to a category or schema class because it is ‘like’ the other resources in some way –Categories form part of a hierarchical structure when applied to such subjects as a taxonomy How does it work? –Automated categorization is an ‘inferencing’ task- meaning that we have to tell the tools what makes up a category and then how to decide whether something fits that category or not –We have to teach it to think like a human being – When I see -- access to phone lines, analog cellular systems, answer bid rate, answer seizure rate – I know this should be categorized as ‘telecommunications’ We use domain vocabularies to create the category descriptions

Rule Based Categorization How do we build it? 1.Build the hierarchy of categories a)Manually if you have a scheme in place and maintained by people b)Programmatically if you need to discover what the scheme should be 2.Build a training set of content category by category – from all kinds of content 3.Describe each category in terms of its ‘ontology’ – in our case this means the concepts that describe it (generally between 1,000 and 10,000 concepts) 4.Filter the list to discover groups of concepts 5.The richer the definition, the better the categorization engine works 6.Test each category profile on the training set 7.Test the category profile on a larger set that is outside the domain 8.Insert the categirt profile into the profile for the larger hierarchy

Rule Based Categorization State of the Industry –Only a handful of rule-based categorizers are on the market today –Most of the existing technologies are dynamic clustering tools –However, the market will probably grow in this area as the demand grows

Categorization Examples Let’s look at some working examples by going to the Teragram TK240 profiles –Topics –Countries –Regions –Sector –Theme –Disease Profiles Other categorization profiles we’re also working on… –Business processes (characteristics of business processes) –Sentiment ratings (positive media statements, negative media statements, etc.) –Document types (by characteristics found in the documents) –Security classification (by characteristics found in the documents)

Topic Hierarchy From Relationships across data classes Build the rules at the lowest level of categorization

Subtopics Domain concepts or controlled vocabulary

Topics Categorization Client Test

Automatically Generated XML Metadata

Automatically Generated Metadata

Automatically Generated XML Metadata for Business Function attribute Office memorandum on requesting CD’s clearance of the Board Package for NEPAL: Economic Reforms Technical Assistance (ERTA)

Clustering What is it? –The use of statistical and data mining techniques to partition data into sets. Generally the partitioning is based on statistical co-occurrence of words, and their proximity to or distance from each other How does it work? –Those words that have frequent occurrences close to one another are assigned to the same cluster –Clusters can be defined at the set or the concept level – usually the latter –Can work with a raw training set of text to discover and associate concepts or to suggest ‘buckets’ of concepts –Some few tools can work with refined list of concepts to be clustered against a text corpus –Please note the difference between clustering words in content and clustering domain concepts – major distinction

Clustering vs. Categorization Clustering Categorization

Feeder Clustering How do we build it? 1.Define the list of concepts 2.Create the training set 3.Load the concepts into the clustering engine 4.Generate the concept clusters State of Industry –Most of the commercial tools that call themselves ‘categorizers’ are actually clustering engines –Generally, doesn’t work at a high domain level for large sets of text –They can provide insights into concepts in a domain when used on a small set of documents –All the engines are resource intense, though, and the outputs are transitory – clusters live only in the cluster index –If you change the text set, the cluster changes

Clustering Concepts This is from the clustering output for 12.15.00 - Wildlife Resources. ‘Clusters’ of concepts between line breaks are terms from the Wildlife Resources controlled vocabulary found co-occurring in the same training document. This highlights often subtle relationships.

Clustering Words in Content Clusters of words based on occurrences in the content

Summarization What is it? –Rule-driven pattern matching and sentence extraction programs –Important to distinguish summarization technologies from some information extraction technologies - many on the market extract ‘fragments’ of sentences – what Google does when it presents a search result to you –Will generate document surrogates, poiint of view summaries, HTML metatag Description, and ‘gist’ or ‘synopsis’ for search indexing –Results are sufficient for ‘gisting’ for html metatags, as surrogates for full text document indexing, or as summaries to display in search results to give the user a sense of the content How does it work? –Uses rules and conditions for selecting sentences –Enables us to define how many sentences to select –Allows us to tell us the concepts to use to select sentences –Allows us to determine where in the sentence the concepts might occur –Allows us to exclude sentences from being selected –We can write multiple sets of rules for different kinds of content

Summarization How do we build it? 1.Analyze the content to be summarized to understand the type of speech and writing used – IRIS is different from Publications is different from News stories 2.Identify the key concepts that should trigger a sentence extraction 3.Identify where in the sentence these concepts are likely to occur 4.Identify the concepts that should be avoided 5.Convert concepts and conditions to a rule format 6.Load the rule file onto the summarization server 7.Test the rules against test set of content and refine until ‘done’ 8.Launch the summarization engine and call the rule file State of Industry –Most tools are either readers or extractors. Reader method uses clustering & weighting to promote sentence fragments. Extractor method uses internal format representation, word & sentence weighting –What has been missing from the Extractors in most commercial products is the capability to specify the concepts and the rules. Teragram is the only product we found to support this.

Summarization Rules Code Where would appear in the sentenceIt is likely to be includedSyntax 5anywhere in the sentenceIt is likely not to be includedcopyright/2004,5 9anywhere in the sentenceDefinitely not includedfor/example,9 7anywhere in the sentenceDefinitely to be includedgot/the/top/grade,7 10anywhere in the sentenceIt is likely to be includedpull/off/that/coup,10 2 anywhere in the sentence, followed by the secondIt is likely to be includedevidence,2:collected 1beginning of the sentenceIt is likely to be includedwe/report,1 6beginning of the sentenceDefinitely to be includedreporting/on,6 8beginning of the sentenceDefinitely not includedcopyright/reserved,8 3 beginning of the sentence; only if the preceding sentence qualifiesIt is likely to be includedhowever,3 4 beginning of the sentence; only if the preceding sentence qualifiesDefinitely to be includedthe/former,4

Automatically Generated Gist PID Bosnia-Herzegovina Private Sector Credit Project Rules –agreed/to,10 –with/the/objective,10 –objective,2:project –proposed,2:project –assist/in,10 Gist

Impacts & Outcomes Productivity Improvements –Can now assign deep metadata to all kinds of content –Remove the human review aspect from the metadata capture –Reduce unit times where human review is still used Information Quality impacts –The metadata created is consistent –All metadata carries the information architecture with it –Apply quality metrics at the metadata level to eliminate need to build ‘fuzzy search architectures’ – these rarely scale or improve in performance –Use the technologies to identify and fix problems with our data

Lessons Learned All semantic interoperability challenges are practical which means that there is a context in which they are used Don’t try to solve semantic challenges that don’t pertain to your environment – thing long term about use Analyze the context to determine the highest value semantic challenges Leverage what others have done, but don’t adopt their SI solutions as a black box solution – won’t work unless you have identical contexts Start by modeling the context – you might begin with a logical reference model or an ontology

Additional Applications 60 years of content which is not characterized in terms of its business process – retrospectively categorize to provide an important perspective People and Institutions Referenced Media Monitoring – generating metadata for news stories from around the work for statistical analysis purposes – how is the Bank perceived in Brazil, in Kenya, in India Capturing important numbers – bid #, project ID, Trust Fund # - where staff don’t input it or make errors in transcription Language detection for content

Thank You! Questions & Discussions

Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank.

Similar presentations

Presentation on theme: "Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank.

Similar presentations

Presentation on theme: "Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank."— Presentation transcript:

Similar presentations

About project

Feedback