Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services
2 Agenda Text Analytics – Foundation – Features and Capabilities Evaluation of Text Analytics – Start with Self-Knowledge – Features and Capabilities – Filter, Proof of Concept / Pilot Text Analytics Development – Progressive Refinement – Categorization, Extraction, Sentiment – Case Studies – Best Practices
3 Semantic Infrastructure - Foundation Text Analytics Features Noun Phrase Extraction – Catalogs with variants, rule based dynamic – Multiple types, custom classes – entities, concepts, events – Feeds facets Summarization – Customizable rules, map to different content Fact Extraction – Relationships of entities – people-organizations-activities – Ontologies – triples, RDF, etc. Sentiment Analysis – Rules – Objects and phrases – positive and negative
4 Semantic Infrastructure - Foundation Text Analytics Features Auto-categorization – Training sets – Bayesian, Vector space – Terms – literal strings, stemming, dictionary of related terms – Rules – simple – position in text (Title, body, url) – Semantic Network – Predefined relationships, sets of rules – Boolean– Full search syntax – AND, OR, NOT – Advanced – NEAR (#), PARAGRAPH, SENTENCE This is the most difficult to develop Build on a Taxonomy Combine with Extraction – If any of list of entities and other words
5
6
7
8
9
10
11
12
13
Semantic Infrastructure - Foundation Vendors of Taxonomy/ Text Analytics Software – Attensity – SAP - Business Objects – Inxight – Clarabridge – ClearForest – Concept Searching – Data Harmony / Access Innovations – Expert Systems – GATE (Open Source) – IBM Content Analyst – Lexalytics – Multi-Tes – Nstein – SAS - Teragram – SchemaLogic – Smart Logic – Synaptica – Ontology Vendors 14
15 Semantic Infrastructure - Foundation Varieties of Taxonomy/ Text Analytics Software Taxonomy Management – Synaptica, SchemaLogic Full Platform – SAP-Inxight, Clear Forest, SAS- Teragram, Data Harmony, Concept Searching, IBM Content Management – Nstein, Interwoven, Documentum, etc. Embedded – Search – FAST, Autonomy, Endeca, Exalead, etc. Specialty – Sentiment Analysis - Lexalytics
16 Evaluating Taxonomy/Text Analytics Software Start with Self Knowledge Strategic and Business Context Info Problems – what, how severe Strategic Questions – why, what value from the taxonomy/text analytics, how are you going to use it Formal Process - KA audit – content, users, technology, business and information behaviors, applications - Or informal for smaller organization, Text Analytics Strategy/Model – forms, technology, people – Existing taxonomic resources, software Need this foundation to evaluate and to develop
17 Evaluating Taxonomy/Text Analytics Software Start with Self Knowledge Do you need it – and what blend if so? Taxonomy Management Stand alone – Multiple taxonomies, languages, authors-editors Technology Environment – ECM, Enterprise Search – where is it embedded Publishing Process – where and how is metadata being added – now and projected future – Can it utilize auto-categorization, entity extraction, summarization Is the current search adequate – can it utilize text analytics? Applications – text mining, BI, CI, Alerts?
Semantic Infrastructure - Foundation Design of the Text Analytics Selection Team Interdisciplinary Team, led by Information Professionals – IT – software experience, budget, support tests – Business – understand business and requirements – Library – understand information structure, understanding of search semantics and functionality Much more likely to make a good decision – This is not a traditional IT software evaluation – semantics Create the foundation for implementation 18
Semantic Infrastructure - Foundation Evaluating Text Analytics Software – Process Start with Self Knowledge Eliminate the unfit – Filter One - Ask Experts - reputation, research – Gartner, etc. Market strength of vendor, platforms, etc. Feature scorecard – minimum, must have, filter to top 3-4 – Filter Two – Technology Filter – match to your overall scope and capabilities – Filter not a focus Filter Three – Focus Group one day visit – 3-4 vendors Deep pilot (2) / POC – advanced, integration, semantics – Two Questions – who is better, can it be done, for how much Focus on working relationship with vendor. 19
Semantic Infrastructure - Foundation Evaluating Taxonomy Software - POC Quality of results is the essential factor 6 weeks POC – bake off / or short pilot Real life scenarios, categorization with your content Preparation: – Preliminary analysis of content and users information needs – Set up software in lab – relatively easy – Train taxonomist(s) on software(s) – Develop taxonomy if none available Six week POC – 3 rounds of development, test, refine / Not OOB Need SME’s as test evaluators – also to do an initial categorization of content 20
Semantic Infrastructure - Foundation Evaluating Taxonomy Software - POC Scenarios – categorization, extraction, summarization, etc. Majority of time is on auto-categorization Need to balance uniformity of results with vendor unique capabilities – have to determine at POC time Elements: – Content – Search terms / search scenarios – Training and test sets Taxonomy Developers – expert consultants plus internal taxonomists Evaluate usability in action by taxonomists 21
Semantic Infrastructure - Foundation Evaluating Taxonomy Software – POC Issues Quality of content Quality of initial human categorization Normalize among different test evaluators Quality of taxonomists – experience with text analytics software and/or experience with content and information needs and behaviors Quality of taxonomy – General issues – structure (too flat or too deep) – Overlapping categories – Differences in use – browse, index, categorize Categorization essential issue is complexity of language Entity Extraction essential issue is scale, disambiguation 22
Semantic Infrastructure - Foundation Case Study: Telecom Service Company History, Reputation Full Platform –Categorization, Extraction, Sentiment Integration – java, API-SDK, Linux Multiple languages Scale – millions of docs a day Total Cost of Ownership Ease of Development - new Vendor Relationship – OEM, etc. Expert Systems IBM SAS - Teragram Smart Logic Option – Multiple vendors – Sentiment & Platform IBM and SAS – finalists 23
24 Semantic Infrastructure - Foundation POC Design Discussion: Evaluation Criteria Basic Test Design – categorize test set – Score – by file name, human testers Categorization – Call Motivation – Accuracy Level – 80-90% – Effort Level per accuracy level Sentiment Analysis – Accuracy Level – 80-90% – Effort Level per accuracy level Quantify development time – main elements Comparison of two vendors – how score? – Combination of scores and report
Text Analytics POC Outcomes Categorization Results SASIBM Recall-Motivation Recall-Actions Precision – Mot.84.3 Precision-Act100 Uncategorized87.5 Raw Precision
Text Analytics POC Outcomes Vendor Comparisons Categorization Results – both good, edge to SAS on precision – Use of Relevancy to set thresholds Development Environment – IBM as toolkit provides more flexibility but it also increases development effort Methodology – IBM enforces good method, but takes more time – SAS can be used in exactly the same way SAS has a much more complete set of operators – NOT, DIST, START 26
Text Analytics POC Outcomes Vendor Comparisons - Functionality Sentiment Analysis – SAS has workbench, IBM would require more development – SAS also has statistical modeling capabilities Entity and Fact extraction – seems basically the same – SAS and use operators for improved disambiguation – Summarization – SAS has built-in – IBM could develop using categorization rules – but not clear that would be as effective without operators Conclusion: Both can do the job, edge to SAS Now the fun begins - development 27
28 Text Analytics Development: Foundation Articulated Information Management Strategy (K Map) – Content and Structures and Metadata – Search, ECM, applications - and how used in Enterprise – Community information needs and Text Analytics Team POC establishes the preliminary foundation – Need to expand and deepen – Content – full range, basis for rules-training – Additional SME’s – content selection, refinement Taxonomy – starting point for categorization / suitable? Databases – starting point for entity catalogs
29 Text Analytics Development Enterprise Environment – Case Studies A Tale of Two Taxonomies – It was the best of times, it was the worst of times Basic Approach – Initial meetings – project planning – High level K map – content, people, technology – Contextual and Information Interviews – Content Analysis – Draft Taxonomy – validation interviews, refine – Integration and Governance Plans
30 Text Analytics Development Enterprise Environment – Case One – Taxonomy, 7 facets Taxonomy of Subjects / Disciplines: – Science > Marine Science > Marine microbiology > Marine toxins Facets: – Organization > Division > Group – Clients > Federal > EPA – Instruments > Environmental Testing > Ocean Analysis > Vehicle – Facilities > Division > Location > Building X – Methods > Social > Population Study – Materials > Compounds > Chemicals – Content Type – Knowledge Asset > Proposals
31 Text Analytics Development Enterprise Environment – Case One – Taxonomy, 7 facets Project Owner – KM department – included RM, business process Involvement of library - critical Realistic budget, flexible project plan Successful interviews – build on context – Overall information strategy – where taxonomy fits Good Draft taxonomy and extended refinement – Software, process, team – train library staff – Good selection and number of facets Final plans and hand off to client
32 Text Analytics Development Enterprise Environment – Case Two – Taxonomy, 4 facets Taxonomy of Subjects / Disciplines: – Geology > Petrology Facets: – Organization > Division > Group – Process > Drill a Well > File Test Plan – Assets > Platforms > Platform A – Content Type > Communication > Presentations
33 Text Analytics Development Enterprise Environment – Case Two – Taxonomy, 4 facets Environment Issues – Value of taxonomy understood, but not the complexity and scope – Under budget, under staffed – Location – not KM – tied to RM and software Solution looking for the right problem – Importance of an internal library staff – Difficulty of merging internal expertise and taxonomy
34 Text Analytics Development Enterprise Environment – Case Two – Taxonomy, 4 facets Project Issues – Project mind set – not infrastructure – Wrong kind of project management Special needs of a taxonomy project Importance of integration – with team, company – Project plan more important than results Rushing to meet deadlines doesn’t work with semantics as well as software
35 Text Analytics Development Enterprise Environment – Case Two – Taxonomy, 4 facets Research Issues – Not enough research – and wrong people – Interference of non-taxonomy – communication – Misunderstanding of research – wanted tinker toy connections Interview 1 implies conclusion A Design Issues – Not enough facets – Wrong set of facets – business not information – Ill-defined facets – too complex internal structure
36 Text Analytics Development Conclusion: Risk Factors Political-Cultural-Semantic Environment – Not simple resistance - more subtle – re-interpretation of specific conclusions and sequence of conclusions / Relative importance of specific recommendations Understanding project scope Access to content and people – Enthusiastic access Importance of a unified project team – Working communication as well as weekly meetings
37 Text Analytics Development Case Study 2 – POC – Telecom Client Demo of SAS – Teragram / Enterprise Content Categorization
38 Text Analytics Development Best Practices - Principles Importance of ongoing maintenance and refinement Need dedicated taxonomy team working with SME’s Work with application developers to incorporate text analytics into new applications Importance of metrics and feedback – Software and social Questions: – What are important subjects (and changes) – What information do they need? – How is their information related to other silos?
39 Text Analytics Development Best Practices - Principles Process – Realistic Budget – not a nice to have add on – Flexible Project plan - semantics are complex and messy Time estimates are difficult, object success measures are too – Transition from development to maintenance is fluid Resources – Interdisciplinary Team is essential – Importance of communication – languages – Merging internal and external expertise
40 Text Analytics Development Best Practices - Principles Categorization taxonomy structure – Tradeoff of depth and complexity of rules – Multiple avenues – facets, terms, rules, etc. No right balance – Recall-precision balance is application specific – Training sets of starting points, rules rule – Need for custom development Technology – Basic integration – XML – Advanced –combine unstructured and structured in new ways
41 Text Analytics Development Best Practices – Risk Factors Value understood, but not the complexity and scope Project mindset – software project and then done Not enough research on user information needs, behaviors – Talking to the right people and asking the right questions – Getting beyond “All of the Above” surveys Not enough resources, wrong resources Enthusiastic access to content and people Bad design – starting with the wrong type of taxonomy Categorization is not library science – More like cognitive anthropology
42 Semantic Infrastructure Development Conclusion Text Analytics is the Foundation for Semantic infrastructure Evaluation of Text Analytics – different than IT software – POC – essential, foundation of development – Difference of taxonomy and categorization Concepts vs. text in documents Enterprise Context – strategic, self-knowledge – Infrastructure resource, not a project – Interdisciplinary Team and applications Integration with other initiatives and technologies – Text Mining, Data Mining, Sentiment & beyond, Everything!
Questions? Tom Reamy KAPS Group Knowledge Architecture Professional Services