Presentation is loading. Please wait.

Presentation is loading. Please wait.

Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

Similar presentations


Presentation on theme: "Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services"— Presentation transcript:

1 Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services http://www.kapsgroup.com

2 2 Agenda  Text Analytics – Foundation – Features and Capabilities  Evaluation of Text Analytics – Start with Self-Knowledge – Features and Capabilities – Filter, Proof of Concept / Pilot  Text Analytics Development – Progressive Refinement – Categorization, Extraction, Sentiment – Case Studies – Best Practices

3 3 Semantic Infrastructure - Foundation Text Analytics Features  Noun Phrase Extraction – Catalogs with variants, rule based dynamic – Multiple types, custom classes – entities, concepts, events – Feeds facets  Summarization – Customizable rules, map to different content  Fact Extraction – Relationships of entities – people-organizations-activities – Ontologies – triples, RDF, etc.  Sentiment Analysis – Rules – Objects and phrases – positive and negative

4 4 Semantic Infrastructure - Foundation Text Analytics Features  Auto-categorization – Training sets – Bayesian, Vector space – Terms – literal strings, stemming, dictionary of related terms – Rules – simple – position in text (Title, body, url) – Semantic Network – Predefined relationships, sets of rules – Boolean– Full search syntax – AND, OR, NOT – Advanced – NEAR (#), PARAGRAPH, SENTENCE  This is the most difficult to develop  Build on a Taxonomy  Combine with Extraction – If any of list of entities and other words

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14 Evaluating Text Analytics Software Start with Self Knowledge  Strategic and Business Context  Info Problems – what, how severe  Strategic Questions – why, what value from the taxonomy/text analytics, how are you going to use it  Formal Process - KA audit – content, users, technology, business and information behaviors, applications - Or informal for smaller organization,  Text Analytics Strategy/Model – forms, technology, people – Existing taxonomic resources, software  Need this foundation to evaluate and to develop

15 15 Evaluating Text Analytics Software Start with Self Knowledge  Do you need it – and what blend if so?  Taxonomy Management Stand alone – Multiple taxonomies, languages, authors-editors  Technology Environment – ECM, Enterprise Search – where is it embedded  Publishing Process – where and how is metadata being added – now and projected future – Can it utilize auto-categorization, entity extraction, summarization  Is the current search adequate – can it utilize text analytics?  Applications – text mining, BI, CI, Alerts?

16 Evaluating Text Analytics Software Team - Interdisciplinary  IT – Large software purchase, needs assessment Text Analytics is different – semantics Construction company designing your house  Business – Understand the business needs Don’t understand information Restaurant owner doing the cooking  Library - know information, search Don’t understand the business, non-information experts Accountant doing financial strategy  Team – combination of consulting and internal 16

17 Semantic Infrastructure - Foundation Design of the Text Analytics Selection Team  Interdisciplinary Team, led by Information Professionals – IT – software experience, budget, support tests – Business – understand business and requirements – Library – understand information structure, understanding of search semantics and functionality  Much more likely to make a good decision – This is not a traditional IT software evaluation – semantics  Create the foundation for implementation 17

18 Evaluating Text Analytics Software Evaluation Process & Methodology: Two Phases  Phase I – Traditional Software Evaluation – Filter One- Ask Experts - reputation, research – Gartner, etc. Market strength of vendor, platforms, etc. – Filter Two - Feature scorecard – minimum, must have, filter to top 3 – Filter Three – Technology Filter – match to your overall scope and capabilities – Filter not a focus – Filter Four – In-Depth Demo – 3-6 vendors  Phase II - Deep POC (2) – advanced, integration, semantics 18

19 Evaluating Text Analytics Software Phase II - Proof Of Concept - POC  4-6 weeks POC – bake off / or short pilot  Measurable Quality of results is the essential factor  Real life scenarios, categorization with your content  2-3 rounds of development, test, refine / Not OOB  Need SME’s as test evaluators – also to do an initial categorization of content  Majority of time is on auto-categorization  Need to balance uniformity of results with vendor unique capabilities – have to determine at POC time  Taxonomy Developers – expert consultants plus internal taxonomists 19

20 Evaluating Text Analytics Software Phase II – POC: Range of Evaluations  Basic Question – Can this stuff work at all?  Auto-categorization to existing taxonomy – variety of content – Essential Issue is complexity of language  Clustering – automatic node generation  Summarization  Entity extraction – build a number of catalogs – design which ones based on projected needs – example privacy info (SS#, phone, etc.)  Entity example –people, organization, methods, etc. – Essential issue is scale and disambiguation  Evaluate usability in action by taxonomists 20

21 21 Text Analytics Evaluation: Case Study Self-Knowledge  Platform – range of capabilities – Categorization, Sentiment analysis, etc.  Technical – API’s, Java based, Linux run time – Scalability – millions of documents a day – Import-Export – XML, RDF  Total Cost of Ownership  Vendor Relationship - OEM  Usability, Multiple Language Support  Team – 3 KAPS - Information  5-8 Amdocs – SME - business, Technical.

22 Text Analytics Evaluation: Case Study Phase I – Case Study – Attensity – SAP – Inxight – Clarabridge – ClearForest – Concept Searching – Data Harmony / Access Innovations – Expert Systems – GATE (Open Source) – IBM – Lexalytics – Multi-Tes – Nstein – SAS – SchemaLogic – Smart Logic – Content Management – Enterprise Search – Sentiment Analysis Specialty – Ontology Platforms 22

23 Text Analytics Evaluation: Case Study Case Study: Telecom Service  Company History, Reputation  Full Platform –Categorization, Extraction, Sentiment  Integration – java, API-SDK, Linux  Multiple languages  Scale – millions of docs a day  Total Cost of Ownership  Ease of Development - new  Vendor Relationship – OEM, etc.  Expert Systems  IBM  SAS - Teragram  Smart Logic  Option – Multiple vendors – Sentiment & Platform  IBM and SAS – finalists 23

24 24 Text Analytics Evaluation: Case Study POC Design Discussion: Evaluation Criteria  Basic Test Design – categorize test set – Score – by file name, human testers  Categorization – Call Motivation – Accuracy Level – 80-90% – Effort Level per accuracy level  Sentiment Analysis – Accuracy Level – 80-90% – Effort Level per accuracy level  Quantify development time – main elements  Comparison of two vendors – how score? – Combination of scores and report

25 Text Analytics Evaluation: Case Study Phase II – POC: Risks  CIO/CTO Problem –This is not a regular software process  Language is messy not just complex – 30% accuracy isn’t 30% done – could be 90%  Variability of human categorization / expression – Even professional writers – journalists examples  Categorization is iterative, not “the program works” – Need realistic budget and flexible project plan  Anyone can do categorization – Librarians often overdo, SME’s often get lost (keywords)  Meta-language issues – understanding the results – Need to educate IT and business in their language 25

26 Text Analytics POC Outcomes Categorization Results SASIBM Recall-Motivation92.690.7 Recall-Actions93.888.3 Precision – Mot.84.3 Precision-Act100 Uncategorized87.5 Raw Precision7346 26

27 Text Analytics POC Outcomes Vendor Comparisons  Categorization Results – both good, edge to SAS on precision – Use of Relevancy to set thresholds  Development Environment – IBM as toolkit provides more flexibility but it also increases development effort  Methodology – IBM enforces good method, but takes more time – SAS can be used in exactly the same way  SAS has a much more complete set of operators – NOT, DIST, START 27

28 Text Analytics POC Outcomes Vendor Comparisons - Functionality  Sentiment Analysis – SAS has workbench, IBM would require more development – SAS also has statistical modeling capabilities  Entity and Fact extraction – seems basically the same – SAS and use operators for improved disambiguation –  Summarization – SAS has built-in – IBM could develop using categorization rules – but not clear that would be as effective without operators  Conclusion: Both can do the job, edge to SAS  Now the fun begins - development 28

29 29 Text Analytics Development: Foundation  Articulated Information Management Strategy (K Map) – Content and Structures and Metadata – Search, ECM, applications - and how used in Enterprise – Community information needs and Text Analytics Team  POC establishes the preliminary foundation – Need to expand and deepen – Content – full range, basis for rules-training – Additional SME’s – content selection, refinement  Taxonomy – starting point for categorization / suitable?  Databases – starting point for entity catalogs

30 30 Text Analytics Development Enterprise Environment – Case Studies  A Tale of Two Taxonomies – It was the best of times, it was the worst of times  Basic Approach – Initial meetings – project planning – High level K map – content, people, technology – Contextual and Information Interviews – Content Analysis – Draft Taxonomy – validation interviews, refine – Integration and Governance Plans

31 31 Text Analytics Development Enterprise Environment – Case One – Taxonomy, 7 facets  Taxonomy of Subjects / Disciplines: – Science > Marine Science > Marine microbiology > Marine toxins  Facets: – Organization > Division > Group – Clients > Federal > EPA – Instruments > Environmental Testing > Ocean Analysis > Vehicle – Facilities > Division > Location > Building X – Methods > Social > Population Study – Materials > Compounds > Chemicals – Content Type – Knowledge Asset > Proposals

32 32 Text Analytics Development Enterprise Environment – Case One – Taxonomy, 7 facets  Project Owner – KM department – included RM, business process  Involvement of library - critical  Realistic budget, flexible project plan  Successful interviews – build on context – Overall information strategy – where taxonomy fits  Good Draft taxonomy and extended refinement – Software, process, team – train library staff – Good selection and number of facets  Final plans and hand off to client

33 33 Text Analytics Development Enterprise Environment – Case Two – Taxonomy, 4 facets  Taxonomy of Subjects / Disciplines: – Geology > Petrology  Facets: – Organization > Division > Group – Process > Drill a Well > File Test Plan – Assets > Platforms > Platform A – Content Type > Communication > Presentations

34 34 Text Analytics Development Enterprise Environment – Case Two – Taxonomy, 4 facets  Environment Issues – Value of taxonomy understood, but not the complexity and scope – Under budget, under staffed – Location – not KM – tied to RM and software Solution looking for the right problem – Importance of an internal library staff – Difficulty of merging internal expertise and taxonomy

35 35 Text Analytics Development Enterprise Environment – Case Two – Taxonomy, 4 facets  Project Issues – Project mind set – not infrastructure – Wrong kind of project management Special needs of a taxonomy project Importance of integration – with team, company – Project plan more important than results Rushing to meet deadlines doesn’t work with semantics as well as software

36 36 Text Analytics Development Enterprise Environment – Case Two – Taxonomy, 4 facets  Research Issues – Not enough research – and wrong people – Interference of non-taxonomy – communication – Misunderstanding of research – wanted tinker toy connections Interview 1 implies conclusion A  Design Issues – Not enough facets – Wrong set of facets – business not information – Ill-defined facets – too complex internal structure

37 37 Text Analytics Development Conclusion: Risk Factors  Political-Cultural-Semantic Environment – Not simple resistance - more subtle – re-interpretation of specific conclusions and sequence of conclusions / Relative importance of specific recommendations  Understanding project scope  Access to content and people – Enthusiastic access  Importance of a unified project team – Working communication as well as weekly meetings

38 38 Text Analytics Development Case Study 2 – POC – Telecom Client  Demo of SAS - / Enterprise Content Categorization

39 39 Text Analytics Development Best Practices - Principles  Importance of ongoing maintenance and refinement  Need dedicated taxonomy team working with SME’s  Work with application developers to incorporate text analytics into new applications  Importance of metrics and feedback – Software and social  Questions: – What are important subjects (and changes) – What information do they need? – How is their information related to other silos?

40 40 Text Analytics Development Best Practices - Principles  Process – Realistic Budget – not a nice to have add on – Flexible Project plan - semantics are complex and messy Time estimates are difficult, object success measures are too – Transition from development to maintenance is fluid  Resources – Interdisciplinary Team is essential – Importance of communication – languages – Merging internal and external expertise

41 41 Text Analytics Development Best Practices - Principles  Categorization taxonomy structure – Tradeoff of depth and complexity of rules – Multiple avenues – facets, terms, rules, etc. No right balance – Recall-precision balance is application specific – Training sets of starting points, rules rule – Need for custom development  Technology – Basic integration – XML – Advanced –combine unstructured and structured in new ways

42 42 Text Analytics Development Best Practices – Risk Factors  Value understood, but not the complexity and scope  Project mindset – software project and then done  Not enough research on user information needs, behaviors – Talking to the right people and asking the right questions – Getting beyond “All of the Above” surveys  Not enough resources, wrong resources  Enthusiastic access to content and people  Bad design – starting with the wrong type of taxonomy  Categorization is not library science – More like cognitive anthropology

43 43 Semantic Infrastructure Development Conclusion  Text Analytics is the Foundation for Semantic infrastructure  Evaluation of Text Analytics – different than IT software – POC – essential, foundation of development – Difference of taxonomy and categorization Concepts vs. text in documents  Enterprise Context – strategic, self-knowledge – Infrastructure resource, not a project – Interdisciplinary Team and applications  Integration with other initiatives and technologies – Text Mining, Data Mining, Sentiment & beyond, Everything!

44 Questions? Tom Reamy tomr@kapsgroup.com KAPS Group Knowledge Architecture Professional Services http://www.kapsgroup.com


Download ppt "Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services"

Similar presentations


Ads by Google