Presentation is loading. Please wait.

Presentation is loading. Please wait.

Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

Similar presentations


Presentation on theme: "Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services"— Presentation transcript:

1 Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services http://www.kapsgroup.com

2 2 Agenda  Text Analytics – Foundation – Features and Capabilities  Evaluation of Text Analytics – Start with Self-Knowledge – Features and Capabilities – Filter, Proof of Concept / Pilot  Text Analytics Development – Progressive Refinement – Categorization, Extraction, Sentiment – Case Studies – Best Practices

3 3 Semantic Infrastructure - Foundation Text Analytics Features  Noun Phrase Extraction – Catalogs with variants, rule based dynamic – Multiple types, custom classes – entities, concepts, events – Feeds facets  Summarization – Customizable rules, map to different content  Fact Extraction – Relationships of entities – people-organizations-activities – Ontologies – triples, RDF, etc.  Sentiment Analysis – Rules – Objects and phrases – positive and negative

4 4 Semantic Infrastructure - Foundation Text Analytics Features  Auto-categorization – Training sets – Bayesian, Vector space – Terms – literal strings, stemming, dictionary of related terms – Rules – simple – position in text (Title, body, url) – Semantic Network – Predefined relationships, sets of rules – Boolean– Full search syntax – AND, OR, NOT – Advanced – NEAR (#), PARAGRAPH, SENTENCE  This is the most difficult to develop  Build on a Taxonomy  Combine with Extraction – If any of list of entities and other words

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 Semantic Infrastructure - Foundation Vendors of Taxonomy/ Text Analytics Software – Attensity – SAP - Business Objects – Inxight – Clarabridge – ClearForest – Concept Searching – Data Harmony / Access Innovations – Expert Systems – GATE (Open Source) – IBM Content Analyst – Lexalytics – Multi-Tes – Nstein – SAS - Teragram – SchemaLogic – Smart Logic – Synaptica – Ontology Vendors 14

15 15 Semantic Infrastructure - Foundation Varieties of Taxonomy/ Text Analytics Software  Taxonomy Management – Synaptica, SchemaLogic  Full Platform – SAP-Inxight, Clear Forest, SAS- Teragram, Data Harmony, Concept Searching, IBM  Content Management – Nstein, Interwoven, Documentum, etc.  Embedded – Search – FAST, Autonomy, Endeca, Exalead, etc.  Specialty – Sentiment Analysis - Lexalytics

16 16 Evaluating Taxonomy/Text Analytics Software Start with Self Knowledge  Strategic and Business Context  Info Problems – what, how severe  Strategic Questions – why, what value from the taxonomy/text analytics, how are you going to use it  Formal Process - KA audit – content, users, technology, business and information behaviors, applications - Or informal for smaller organization,  Text Analytics Strategy/Model – forms, technology, people – Existing taxonomic resources, software  Need this foundation to evaluate and to develop

17 17 Evaluating Taxonomy/Text Analytics Software Start with Self Knowledge  Do you need it – and what blend if so?  Taxonomy Management Stand alone – Multiple taxonomies, languages, authors-editors  Technology Environment – ECM, Enterprise Search – where is it embedded  Publishing Process – where and how is metadata being added – now and projected future – Can it utilize auto-categorization, entity extraction, summarization  Is the current search adequate – can it utilize text analytics?  Applications – text mining, BI, CI, Alerts?

18 Semantic Infrastructure - Foundation Design of the Text Analytics Selection Team  Interdisciplinary Team, led by Information Professionals – IT – software experience, budget, support tests – Business – understand business and requirements – Library – understand information structure, understanding of search semantics and functionality  Much more likely to make a good decision – This is not a traditional IT software evaluation – semantics  Create the foundation for implementation 18

19 Semantic Infrastructure - Foundation Evaluating Text Analytics Software – Process  Start with Self Knowledge  Eliminate the unfit – Filter One - Ask Experts - reputation, research – Gartner, etc. Market strength of vendor, platforms, etc. Feature scorecard – minimum, must have, filter to top 3-4 – Filter Two – Technology Filter – match to your overall scope and capabilities – Filter not a focus  Filter Three – Focus Group one day visit – 3-4 vendors  Deep pilot (2) / POC – advanced, integration, semantics – Two Questions – who is better, can it be done, for how much  Focus on working relationship with vendor. 19

20 Semantic Infrastructure - Foundation Evaluating Taxonomy Software - POC  Quality of results is the essential factor  6 weeks POC – bake off / or short pilot  Real life scenarios, categorization with your content  Preparation: – Preliminary analysis of content and users information needs – Set up software in lab – relatively easy – Train taxonomist(s) on software(s) – Develop taxonomy if none available  Six week POC – 3 rounds of development, test, refine / Not OOB  Need SME’s as test evaluators – also to do an initial categorization of content 20

21 Semantic Infrastructure - Foundation Evaluating Taxonomy Software - POC  Scenarios – categorization, extraction, summarization, etc.  Majority of time is on auto-categorization  Need to balance uniformity of results with vendor unique capabilities – have to determine at POC time  Elements: – Content – Search terms / search scenarios – Training and test sets  Taxonomy Developers – expert consultants plus internal taxonomists  Evaluate usability in action by taxonomists 21

22 Semantic Infrastructure - Foundation Evaluating Taxonomy Software – POC Issues  Quality of content  Quality of initial human categorization  Normalize among different test evaluators  Quality of taxonomists – experience with text analytics software and/or experience with content and information needs and behaviors  Quality of taxonomy – General issues – structure (too flat or too deep) – Overlapping categories – Differences in use – browse, index, categorize  Categorization essential issue is complexity of language  Entity Extraction essential issue is scale, disambiguation 22

23 Semantic Infrastructure - Foundation Case Study: Telecom Service  Company History, Reputation  Full Platform –Categorization, Extraction, Sentiment  Integration – java, API-SDK, Linux  Multiple languages  Scale – millions of docs a day  Total Cost of Ownership  Ease of Development - new  Vendor Relationship – OEM, etc.  Expert Systems  IBM  SAS - Teragram  Smart Logic  Option – Multiple vendors – Sentiment & Platform  IBM and SAS – finalists 23

24 24 Semantic Infrastructure - Foundation POC Design Discussion: Evaluation Criteria  Basic Test Design – categorize test set – Score – by file name, human testers  Categorization – Call Motivation – Accuracy Level – 80-90% – Effort Level per accuracy level  Sentiment Analysis – Accuracy Level – 80-90% – Effort Level per accuracy level  Quantify development time – main elements  Comparison of two vendors – how score? – Combination of scores and report

25 Text Analytics POC Outcomes Categorization Results SASIBM Recall-Motivation92.690.7 Recall-Actions93.888.3 Precision – Mot.84.3 Precision-Act100 Uncategorized87.5 Raw Precision7346 25

26 Text Analytics POC Outcomes Vendor Comparisons  Categorization Results – both good, edge to SAS on precision – Use of Relevancy to set thresholds  Development Environment – IBM as toolkit provides more flexibility but it also increases development effort  Methodology – IBM enforces good method, but takes more time – SAS can be used in exactly the same way  SAS has a much more complete set of operators – NOT, DIST, START 26

27 Text Analytics POC Outcomes Vendor Comparisons - Functionality  Sentiment Analysis – SAS has workbench, IBM would require more development – SAS also has statistical modeling capabilities  Entity and Fact extraction – seems basically the same – SAS and use operators for improved disambiguation –  Summarization – SAS has built-in – IBM could develop using categorization rules – but not clear that would be as effective without operators  Conclusion: Both can do the job, edge to SAS  Now the fun begins - development 27

28 28 Text Analytics Development: Foundation  Articulated Information Management Strategy (K Map) – Content and Structures and Metadata – Search, ECM, applications - and how used in Enterprise – Community information needs and Text Analytics Team  POC establishes the preliminary foundation – Need to expand and deepen – Content – full range, basis for rules-training – Additional SME’s – content selection, refinement  Taxonomy – starting point for categorization / suitable?  Databases – starting point for entity catalogs

29 29 Text Analytics Development Enterprise Environment – Case Studies  A Tale of Two Taxonomies – It was the best of times, it was the worst of times  Basic Approach – Initial meetings – project planning – High level K map – content, people, technology – Contextual and Information Interviews – Content Analysis – Draft Taxonomy – validation interviews, refine – Integration and Governance Plans

30 30 Text Analytics Development Enterprise Environment – Case One – Taxonomy, 7 facets  Taxonomy of Subjects / Disciplines: – Science > Marine Science > Marine microbiology > Marine toxins  Facets: – Organization > Division > Group – Clients > Federal > EPA – Instruments > Environmental Testing > Ocean Analysis > Vehicle – Facilities > Division > Location > Building X – Methods > Social > Population Study – Materials > Compounds > Chemicals – Content Type – Knowledge Asset > Proposals

31 31 Text Analytics Development Enterprise Environment – Case One – Taxonomy, 7 facets  Project Owner – KM department – included RM, business process  Involvement of library - critical  Realistic budget, flexible project plan  Successful interviews – build on context – Overall information strategy – where taxonomy fits  Good Draft taxonomy and extended refinement – Software, process, team – train library staff – Good selection and number of facets  Final plans and hand off to client

32 32 Text Analytics Development Enterprise Environment – Case Two – Taxonomy, 4 facets  Taxonomy of Subjects / Disciplines: – Geology > Petrology  Facets: – Organization > Division > Group – Process > Drill a Well > File Test Plan – Assets > Platforms > Platform A – Content Type > Communication > Presentations

33 33 Text Analytics Development Enterprise Environment – Case Two – Taxonomy, 4 facets  Environment Issues – Value of taxonomy understood, but not the complexity and scope – Under budget, under staffed – Location – not KM – tied to RM and software Solution looking for the right problem – Importance of an internal library staff – Difficulty of merging internal expertise and taxonomy

34 34 Text Analytics Development Enterprise Environment – Case Two – Taxonomy, 4 facets  Project Issues – Project mind set – not infrastructure – Wrong kind of project management Special needs of a taxonomy project Importance of integration – with team, company – Project plan more important than results Rushing to meet deadlines doesn’t work with semantics as well as software

35 35 Text Analytics Development Enterprise Environment – Case Two – Taxonomy, 4 facets  Research Issues – Not enough research – and wrong people – Interference of non-taxonomy – communication – Misunderstanding of research – wanted tinker toy connections Interview 1 implies conclusion A  Design Issues – Not enough facets – Wrong set of facets – business not information – Ill-defined facets – too complex internal structure

36 36 Text Analytics Development Conclusion: Risk Factors  Political-Cultural-Semantic Environment – Not simple resistance - more subtle – re-interpretation of specific conclusions and sequence of conclusions / Relative importance of specific recommendations  Understanding project scope  Access to content and people – Enthusiastic access  Importance of a unified project team – Working communication as well as weekly meetings

37 37 Text Analytics Development Case Study 2 – POC – Telecom Client  Demo of SAS – Teragram / Enterprise Content Categorization

38 38 Text Analytics Development Best Practices - Principles  Importance of ongoing maintenance and refinement  Need dedicated taxonomy team working with SME’s  Work with application developers to incorporate text analytics into new applications  Importance of metrics and feedback – Software and social  Questions: – What are important subjects (and changes) – What information do they need? – How is their information related to other silos?

39 39 Text Analytics Development Best Practices - Principles  Process – Realistic Budget – not a nice to have add on – Flexible Project plan - semantics are complex and messy Time estimates are difficult, object success measures are too – Transition from development to maintenance is fluid  Resources – Interdisciplinary Team is essential – Importance of communication – languages – Merging internal and external expertise

40 40 Text Analytics Development Best Practices - Principles  Categorization taxonomy structure – Tradeoff of depth and complexity of rules – Multiple avenues – facets, terms, rules, etc. No right balance – Recall-precision balance is application specific – Training sets of starting points, rules rule – Need for custom development  Technology – Basic integration – XML – Advanced –combine unstructured and structured in new ways

41 41 Text Analytics Development Best Practices – Risk Factors  Value understood, but not the complexity and scope  Project mindset – software project and then done  Not enough research on user information needs, behaviors – Talking to the right people and asking the right questions – Getting beyond “All of the Above” surveys  Not enough resources, wrong resources  Enthusiastic access to content and people  Bad design – starting with the wrong type of taxonomy  Categorization is not library science – More like cognitive anthropology

42 42 Semantic Infrastructure Development Conclusion  Text Analytics is the Foundation for Semantic infrastructure  Evaluation of Text Analytics – different than IT software – POC – essential, foundation of development – Difference of taxonomy and categorization Concepts vs. text in documents  Enterprise Context – strategic, self-knowledge – Infrastructure resource, not a project – Interdisciplinary Team and applications  Integration with other initiatives and technologies – Text Mining, Data Mining, Sentiment & beyond, Everything!

43 Questions? Tom Reamy tomr@kapsgroup.com KAPS Group Knowledge Architecture Professional Services http://www.kapsgroup.com


Download ppt "Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services"

Similar presentations


Ads by Google