Presentation is loading. Please wait.

Presentation is loading. Please wait.

SemTech Text Analytics Evaluation Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

Similar presentations


Presentation on theme: "SemTech Text Analytics Evaluation Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services"— Presentation transcript:

1 SemTech Text Analytics Evaluation Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services http://www.kapsgroup.com

2 2 Agenda  Text Analytics Features, Varieties, Vendors  Evaluation Process – Start with Self-Knowledge – Text Analytics Team – Features and Capabilities – Filter  Proof of Concept/Pilot – Themes and Issues – Case Study  Conclusion

3 3 KAPS Group: General  Knowledge Architecture Professional Services  Virtual Company: Network of consultants – 8-10  Partners – SAS, SAP, FAST, Smart Logic, Concept Searching, etc.  Consulting, Strategy, Knowledge architecture audit  Services: – Taxonomy/Text Analytics development, consulting, customization – Technology Consulting – Search, CMS, Portals, etc. – Evaluation of Enterprise Search, Text Analytics – Metadata standards and implementation – Knowledge Management: Collaboration, Expertise, e-learning – Applied Theory – Faceted taxonomies, complexity theory, natural categories

4 4 Introduction to Text Analytics Text Analytics Features  Noun Phrase Extraction (Entity, Concept, Events, etc.) – Catalogs with variants, rule based dynamic – Multiple types, custom classes – entities, concepts, events – Feeds facets  Summarization – Customizable rules, map to different content  Fact Extraction – Relationships of entities – people-organizations-activities – Ontologies – triples, RDF, etc.  Sentiment Analysis – Statistical, rules – full categorization set of operators

5 5 Introduction to Text Analytics Text Analytics Features  Auto-categorization – Training sets – Bayesian, Vector space – Terms – literal strings, stemming, dictionary of related terms – Rules – simple – position in text (Title, body, url) – Semantic Network – Predefined relationships, sets of rules – Boolean– Full search syntax – AND, OR, NOT – Advanced – NEAR (#), PARAGRAPH, SENTENCE  This is the most difficult to develop  Build on a Taxonomy  Combine with Extraction – If any of list of entities and other words

6 6

7 7

8 8

9 9

10 10

11 11

12 12 Varieties of Taxonomy/ Text Analytics Software  Taxonomy Management – Synaptica, SchemaLogic  Full Platform – SAS-Teragram, SAP-Inxight, Smart Logic, Data Harmony, Concept Searching, Expert System, IBM, GATE  Content Management – embedded  Embedded – Search – FAST, Autonomy, Endeca, Exalead, etc.  Specialty – Sentiment Analysis, VOC – Lexalytics, Attensity / Reports – Ontology – extraction, plus ontology

13 Evaluating Taxonomy/Text Analytics Software Start with Self Knowledge  Strategic and Business Context  Info Problems – what, how severe  Strategic Questions – why, what value from the taxonomy/text analytics, how are you going to use it  Formal Process - KA audit – content, users, technology, business and information behaviors, applications - Or informal for smaller organization, application specific initiatives  Text Analytics Strategy/Model – forms, technology, people – Existing taxonomic resources, software  Need this foundation to evaluate and to develop 13

14 Evaluating Taxonomy/Text Analytics Software Start with Self Knowledge  Do you need it – and what blend if so?  Taxonomy Management Full Functionality – Multiple taxonomies, languages, authors-editors  Technology Environment – Text Mining, ECM, Enterprise Search – Where is it embedded, integration issues  Publishing Process – where and how is metadata being added – now and projected future – Can it utilize auto-categorization, entity extraction, summarization  Applications – text mining, BI, CI, Social Media, Mobile? 14

15 Design of the Text Analytics Selection Team  Traditional Candidates - IT  Experience with large software purchases – Search/Categorization is unlike other software  Experience with needs assessments – Need more – know what questions to ask, knowledge audit  Objective criteria – Looking where there is light?  Asking IT to select text analytics software is like asking a construction company to select the design of your house.  They have the budget – OK, they can play. 15

16 Design of the Text Analytics Selection Team  Traditional Candidates - Business Owners  Understand the business – But don’t understand information behavior  Focus on business value, not technology – Focus on semantics is needed  Asking business owners to select text analytics software is like asking a restaurant owner to do the cooking  They can get executive sponsorship, support, and budget. – OK, they can play 16

17 Design of the Text Analytics Selection Team  Traditional Candidates - Library  Understand information structure – But not how it is used in the business  Experts in search experience and categorization – Suitable for experts, not regular users  Asking librarians to select text analytics software is like asking an accountant to establish your financial strategy  Experience with variety of search engines, taxonomy software, integration issues – OK, they can play 17

18 Design of the Text Analytics Selection Team  Interdisciplinary Team, headed by Information Professionals  Relative Contributions – IT – Set necessary conditions, support tests – Business – provide input into requirements, support project – Library – provide input into requirements, add understanding of search semantics and functionality  Much more likely to make a good decision  Create the foundation for implementation 18

19 Evaluating Text Analytics Software – Process  Start with Self Knowledge  Eliminate the unfit – Filter One- Ask Experts - reputation, research – Gartner, etc. Market strength of vendor, platforms, etc. – Filter Two - Feature scorecard – minimum, must have, filter – Filter Three – Technology Filter – match to your overall scope and capabilities – Filter not a focus – Filter Four – Focus Group one day visit – 3-4 vendors  Deep pilot (2) / POC – advanced, integration, semantics  Focus on working relationship with vendor. 19

20 Initial Evaluation Example Outcomes  Filter One: – Company A, B – sentiment analysis focus, weak categorization – Company C – Lack of full suite of text analytics – Company D – business concerns, support – Open Source – license issues – Ontology Vendors – missing categorization capabilities  4 Demos – Saw a variety of different approaches, but – Company X – lacking sentiment analysis, require 2 vendors – Company Y – lack of language support, development cost 20

21 Evaluating Taxonomy Software POC - Approach  Quality of results is the essential factor  6 weeks POC – bake off / or short pilot  Real life scenarios, categorization with your content  Preparation: – Preliminary analysis of content and users information needs – Set up software in lab – relatively easy – Train taxonomist(s) on software(s) – Develop taxonomy if none available  Six week POC – 3 rounds of development, test, refine / Not OOB  Need SME’s as test evaluators – also to do an initial categorization of content 21

22 Evaluating Taxonomy Software POC – Initial Design  Majority of time is on auto-categorization  Need to balance uniformity of results with vendor unique capabilities – have to determine at POC time  Risks – getting software installed and working, getting the right content, initial categorization of content  Elements: – Content – Search terms / search scenarios – Training sets – Test sets of content  Development Team – expert consultants plus internal taxonomists, technical 22

23 Evaluating Taxonomy Software POC – Range of Evaluations  Basic – Can this stuff work at all?  Auto-categorization to existing taxonomy – variety of content  Clustering – automatic node generation  Summarization  Entity extraction – build a number of catalogs – design which ones based on projected needs – example privacy info (SS#, phone, etc.)  Entity example –people, organization, methods, etc.  Evaluate usability in action by taxonomists  Integration – with ontologies  Output – XML, API’s 23

24 Evaluating Text Analytics Software POC - Issues  Quality of content – range of issues – spelling to size to ?  Quality of initial human categorization  Normalize among different test evaluators  Quality of taxonomists – experience with text analytics software and/or experience with content and information needs and behaviors  Quality of taxonomy – General issues – structure (too flat or too deep) – Overlapping categories – Differences in use – browse, index, categorize  Categorization essential issue is complexity of language  Entity Extraction essential issue is scale and disambiguation 24

25 Evaluating Text Analytics Software Risks  CIO/CTO Problem –This is not a regular software process  Language is messy not just complex – 30% accuracy isn’t 30% done – could be 90%  Variability of human categorization / expression – Even professional writers – journalists examples  Categorization is iterative, not “the program works” – Need realistic budget and flexible project plan  Anyone can do categorization – Librarians often overdo, SME’s often get lost (keywords)  Meta-language issues – understanding the results – Need to educate IT and business in their language 25

26 Case Study: Telecom Service  Company History, Reputation  Full Platform –Categorization, Extraction, Sentiment  Integration – java, API-SDK, Linux  Multiple languages  Scale – millions of docs a day  Total Cost of Ownership  Ease of Development - new  Vendor Relationship – OEM, etc.  Expert Systems  IBM  SAS  Smart Logic  Option – Multiple vendors – Sentiment & Platform 26

27 POC Design Discussion: Evaluation Criteria  Basic Test Design – categorize test set – Score – by file name, human testers  Categorization – Accuracy Level – 80-90% – Effort Level per accuracy level  Sentiment Analysis – Accuracy Level – 80-90% – Effort Level per accuracy level  Quantify development time – main elements  Comparison of two vendors – how score? – Combination of scores and report 27

28 Text Analytics POC Outcomes Categorization Results SASIBM Recall-Motivation92.690.7 Recall-Actions93.888.3 Precision – Mot.84.3 Precision-Act100 Uncategorized87.5 Raw Precision7346 28

29 Text Analytics POC Outcomes Vendor Comparisons  Categorization Results – both good, edge to SAS on precision – Use of Relevancy to set thresholds  Development Environment – IBM as toolkit provides more flexibility but it also increases development effort  Methodology – IBM enforces good method, but takes more time – SAS can be used in exactly the same way  SAS has a much more complete set of operators – NOT, DIST, START 29

30 Text Analytics POC Outcomes Vendor Comparisons - Functionality  Sentiment Analysis – SAS has workbench, IBM would require more development – SAS also has statistical modeling capabilities  Entity and Fact extraction – seems basically the same – SAS and use operators for improved disambiguation –  Summarization – SAS has built-in – IBM could develop using categorization rules – but not clear that would be as effective without operators  Conclusion: Both can do the job, edge to SAS 30

31 Conclusion  Start with self-knowledge – what will you use it for? – Current Environment – technology, information  Basic Features are only filters, not scores  Integration – need an integrated team (IT, Business, KA) – For evaluation and development  POC – your content, real world scenarios – not scores  Foundation for development, experience with software – Development is better, faster, cheaper  Categorization is essential, time consuming  Next: Text Analytics + Semantic Web + Ontology – Integration of Data and Text Mining – Mutual Enrichment – smarter data, richer analytics 31

32 Questions? Tom Reamy tomr@kapsgroup.com KAPS Group Knowledge Architecture Professional Services http://www.kapsgroup.com


Download ppt "SemTech Text Analytics Evaluation Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services"

Similar presentations


Ads by Google