Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services
2 Agenda Text Analytics – Foundation – Features and Capabilities Evaluation of Text Analytics – Start with Self-Knowledge – Features and Capabilities – Filter, Proof of Concept / Pilot Text Analytics Development – Progressive Refinement – Categorization, Extraction, Sentiment – Case Studies – Best Practices
3 Semantic Infrastructure - Foundation Text Analytics Features Noun Phrase Extraction – Catalogs with variants, rule based dynamic – Multiple types, custom classes – entities, concepts, events – Feeds facets Summarization – Customizable rules, map to different content Fact Extraction – Relationships of entities – people-organizations-activities – Ontologies – triples, RDF, etc. Sentiment Analysis – Rules – Objects and phrases – positive and negative
4 Semantic Infrastructure - Foundation Text Analytics Features Auto-categorization – Training sets – Bayesian, Vector space – Terms – literal strings, stemming, dictionary of related terms – Rules – simple – position in text (Title, body, url) – Semantic Network – Predefined relationships, sets of rules – Boolean– Full search syntax – AND, OR, NOT – Advanced – NEAR (#), PARAGRAPH, SENTENCE This is the most difficult to develop Build on a Taxonomy Combine with Extraction – If any of list of entities and other words
5
6
7
8
9
10
11
12
13
14 Evaluating Text Analytics Software Start with Self Knowledge Strategic and Business Context Info Problems – what, how severe Strategic Questions – why, what value from the taxonomy/text analytics, how are you going to use it Formal Process - KA audit – content, users, technology, business and information behaviors, applications - Or informal for smaller organization, Text Analytics Strategy/Model – forms, technology, people – Existing taxonomic resources, software Need this foundation to evaluate and to develop
15 Evaluating Text Analytics Software Start with Self Knowledge Do you need it – and what blend if so? Taxonomy Management Stand alone – Multiple taxonomies, languages, authors-editors Technology Environment – ECM, Enterprise Search – where is it embedded Publishing Process – where and how is metadata being added – now and projected future – Can it utilize auto-categorization, entity extraction, summarization Is the current search adequate – can it utilize text analytics? Applications – text mining, BI, CI, Alerts?
Evaluating Text Analytics Software Team - Interdisciplinary IT – Large software purchase, needs assessment Text Analytics is different – semantics Construction company designing your house Business – Understand the business needs Don’t understand information Restaurant owner doing the cooking Library - know information, search Don’t understand the business, non-information experts Accountant doing financial strategy Team – combination of consulting and internal 16
Semantic Infrastructure - Foundation Design of the Text Analytics Selection Team Interdisciplinary Team, led by Information Professionals – IT – software experience, budget, support tests – Business – understand business and requirements – Library – understand information structure, understanding of search semantics and functionality Much more likely to make a good decision – This is not a traditional IT software evaluation – semantics Create the foundation for implementation 17
Evaluating Text Analytics Software Evaluation Process & Methodology: Two Phases Phase I – Traditional Software Evaluation – Filter One- Ask Experts - reputation, research – Gartner, etc. Market strength of vendor, platforms, etc. – Filter Two - Feature scorecard – minimum, must have, filter to top 3 – Filter Three – Technology Filter – match to your overall scope and capabilities – Filter not a focus – Filter Four – In-Depth Demo – 3-6 vendors Phase II - Deep POC (2) – advanced, integration, semantics 18
Evaluating Text Analytics Software Phase II - Proof Of Concept - POC 4-6 weeks POC – bake off / or short pilot Measurable Quality of results is the essential factor Real life scenarios, categorization with your content 2-3 rounds of development, test, refine / Not OOB Need SME’s as test evaluators – also to do an initial categorization of content Majority of time is on auto-categorization Need to balance uniformity of results with vendor unique capabilities – have to determine at POC time Taxonomy Developers – expert consultants plus internal taxonomists 19
Evaluating Text Analytics Software Phase II – POC: Range of Evaluations Basic Question – Can this stuff work at all? Auto-categorization to existing taxonomy – variety of content – Essential Issue is complexity of language Clustering – automatic node generation Summarization Entity extraction – build a number of catalogs – design which ones based on projected needs – example privacy info (SS#, phone, etc.) Entity example –people, organization, methods, etc. – Essential issue is scale and disambiguation Evaluate usability in action by taxonomists 20
21 Text Analytics Evaluation: Case Study Self-Knowledge Platform – range of capabilities – Categorization, Sentiment analysis, etc. Technical – API’s, Java based, Linux run time – Scalability – millions of documents a day – Import-Export – XML, RDF Total Cost of Ownership Vendor Relationship - OEM Usability, Multiple Language Support Team – 3 KAPS - Information 5-8 Amdocs – SME - business, Technical.
Text Analytics Evaluation: Case Study Phase I – Case Study – Attensity – SAP – Inxight – Clarabridge – ClearForest – Concept Searching – Data Harmony / Access Innovations – Expert Systems – GATE (Open Source) – IBM – Lexalytics – Multi-Tes – Nstein – SAS – SchemaLogic – Smart Logic – Content Management – Enterprise Search – Sentiment Analysis Specialty – Ontology Platforms 22
Text Analytics Evaluation: Case Study Case Study: Telecom Service Company History, Reputation Full Platform –Categorization, Extraction, Sentiment Integration – java, API-SDK, Linux Multiple languages Scale – millions of docs a day Total Cost of Ownership Ease of Development - new Vendor Relationship – OEM, etc. Expert Systems IBM SAS - Teragram Smart Logic Option – Multiple vendors – Sentiment & Platform IBM and SAS – finalists 23
24 Text Analytics Evaluation: Case Study POC Design Discussion: Evaluation Criteria Basic Test Design – categorize test set – Score – by file name, human testers Categorization – Call Motivation – Accuracy Level – 80-90% – Effort Level per accuracy level Sentiment Analysis – Accuracy Level – 80-90% – Effort Level per accuracy level Quantify development time – main elements Comparison of two vendors – how score? – Combination of scores and report
Text Analytics Evaluation: Case Study Phase II – POC: Risks CIO/CTO Problem –This is not a regular software process Language is messy not just complex – 30% accuracy isn’t 30% done – could be 90% Variability of human categorization / expression – Even professional writers – journalists examples Categorization is iterative, not “the program works” – Need realistic budget and flexible project plan Anyone can do categorization – Librarians often overdo, SME’s often get lost (keywords) Meta-language issues – understanding the results – Need to educate IT and business in their language 25
Text Analytics POC Outcomes Categorization Results SASIBM Recall-Motivation Recall-Actions Precision – Mot.84.3 Precision-Act100 Uncategorized87.5 Raw Precision
Text Analytics POC Outcomes Vendor Comparisons Categorization Results – both good, edge to SAS on precision – Use of Relevancy to set thresholds Development Environment – IBM as toolkit provides more flexibility but it also increases development effort Methodology – IBM enforces good method, but takes more time – SAS can be used in exactly the same way SAS has a much more complete set of operators – NOT, DIST, START 27
Text Analytics POC Outcomes Vendor Comparisons - Functionality Sentiment Analysis – SAS has workbench, IBM would require more development – SAS also has statistical modeling capabilities Entity and Fact extraction – seems basically the same – SAS and use operators for improved disambiguation – Summarization – SAS has built-in – IBM could develop using categorization rules – but not clear that would be as effective without operators Conclusion: Both can do the job, edge to SAS Now the fun begins - development 28
29 Text Analytics Development: Foundation Articulated Information Management Strategy (K Map) – Content and Structures and Metadata – Search, ECM, applications - and how used in Enterprise – Community information needs and Text Analytics Team POC establishes the preliminary foundation – Need to expand and deepen – Content – full range, basis for rules-training – Additional SME’s – content selection, refinement Taxonomy – starting point for categorization / suitable? Databases – starting point for entity catalogs
30 Text Analytics Development Enterprise Environment – Case Studies A Tale of Two Taxonomies – It was the best of times, it was the worst of times Basic Approach – Initial meetings – project planning – High level K map – content, people, technology – Contextual and Information Interviews – Content Analysis – Draft Taxonomy – validation interviews, refine – Integration and Governance Plans
31 Text Analytics Development Enterprise Environment – Case One – Taxonomy, 7 facets Taxonomy of Subjects / Disciplines: – Science > Marine Science > Marine microbiology > Marine toxins Facets: – Organization > Division > Group – Clients > Federal > EPA – Instruments > Environmental Testing > Ocean Analysis > Vehicle – Facilities > Division > Location > Building X – Methods > Social > Population Study – Materials > Compounds > Chemicals – Content Type – Knowledge Asset > Proposals
32 Text Analytics Development Enterprise Environment – Case One – Taxonomy, 7 facets Project Owner – KM department – included RM, business process Involvement of library - critical Realistic budget, flexible project plan Successful interviews – build on context – Overall information strategy – where taxonomy fits Good Draft taxonomy and extended refinement – Software, process, team – train library staff – Good selection and number of facets Final plans and hand off to client
33 Text Analytics Development Enterprise Environment – Case Two – Taxonomy, 4 facets Taxonomy of Subjects / Disciplines: – Geology > Petrology Facets: – Organization > Division > Group – Process > Drill a Well > File Test Plan – Assets > Platforms > Platform A – Content Type > Communication > Presentations
34 Text Analytics Development Enterprise Environment – Case Two – Taxonomy, 4 facets Environment Issues – Value of taxonomy understood, but not the complexity and scope – Under budget, under staffed – Location – not KM – tied to RM and software Solution looking for the right problem – Importance of an internal library staff – Difficulty of merging internal expertise and taxonomy
35 Text Analytics Development Enterprise Environment – Case Two – Taxonomy, 4 facets Project Issues – Project mind set – not infrastructure – Wrong kind of project management Special needs of a taxonomy project Importance of integration – with team, company – Project plan more important than results Rushing to meet deadlines doesn’t work with semantics as well as software
36 Text Analytics Development Enterprise Environment – Case Two – Taxonomy, 4 facets Research Issues – Not enough research – and wrong people – Interference of non-taxonomy – communication – Misunderstanding of research – wanted tinker toy connections Interview 1 implies conclusion A Design Issues – Not enough facets – Wrong set of facets – business not information – Ill-defined facets – too complex internal structure
37 Text Analytics Development Conclusion: Risk Factors Political-Cultural-Semantic Environment – Not simple resistance - more subtle – re-interpretation of specific conclusions and sequence of conclusions / Relative importance of specific recommendations Understanding project scope Access to content and people – Enthusiastic access Importance of a unified project team – Working communication as well as weekly meetings
38 Text Analytics Development Case Study 2 – POC – Telecom Client Demo of SAS - / Enterprise Content Categorization
39 Text Analytics Development Best Practices - Principles Importance of ongoing maintenance and refinement Need dedicated taxonomy team working with SME’s Work with application developers to incorporate text analytics into new applications Importance of metrics and feedback – Software and social Questions: – What are important subjects (and changes) – What information do they need? – How is their information related to other silos?
40 Text Analytics Development Best Practices - Principles Process – Realistic Budget – not a nice to have add on – Flexible Project plan - semantics are complex and messy Time estimates are difficult, object success measures are too – Transition from development to maintenance is fluid Resources – Interdisciplinary Team is essential – Importance of communication – languages – Merging internal and external expertise
41 Text Analytics Development Best Practices - Principles Categorization taxonomy structure – Tradeoff of depth and complexity of rules – Multiple avenues – facets, terms, rules, etc. No right balance – Recall-precision balance is application specific – Training sets of starting points, rules rule – Need for custom development Technology – Basic integration – XML – Advanced –combine unstructured and structured in new ways
42 Text Analytics Development Best Practices – Risk Factors Value understood, but not the complexity and scope Project mindset – software project and then done Not enough research on user information needs, behaviors – Talking to the right people and asking the right questions – Getting beyond “All of the Above” surveys Not enough resources, wrong resources Enthusiastic access to content and people Bad design – starting with the wrong type of taxonomy Categorization is not library science – More like cognitive anthropology
43 Semantic Infrastructure Development Conclusion Text Analytics is the Foundation for Semantic infrastructure Evaluation of Text Analytics – different than IT software – POC – essential, foundation of development – Difference of taxonomy and categorization Concepts vs. text in documents Enterprise Context – strategic, self-knowledge – Infrastructure resource, not a project – Interdisciplinary Team and applications Integration with other initiatives and technologies – Text Mining, Data Mining, Sentiment & beyond, Everything!
Questions? Tom Reamy KAPS Group Knowledge Architecture Professional Services