Download presentation
Presentation is loading. Please wait.
1
Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 Text Analytics Evaluation A Case Study: Amdocs Tom Reamy Chief Knowledge Architect KAPS Group http://www.kapsgroup.com
2
Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 Text Analytics Evaluation Case Study Agenda Introduction – Text Analytics Basics Evaluation Process & Methodology Two Stages – Initial Filters & POC Initial Evaluation Results Proof of Concept Methodology Results Final Recommendation Sentiment Analysis and Beyond Conclusions
3
Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 3 KAPS Group: General Knowledge Architecture Professional Services Virtual Company: Network of consultants – 8-10 Partners – SAS – 2 Whitepapers (Semantic infrastructure) GAO, FDA, Amdocs – Sales & Development Projects Other Partners: Smart Logic, FAST, Concept Searching, etc. Consulting, Strategy, Knowledge architecture audit Services: Text Analytics evaluation, development, consulting, customization Knowledge Representation – taxonomy, ontology, Prototype Knowledge Management: Collaboration, Expertise, e-learning Applied Theory – Faceted taxonomies, complexity theory, natural categories
4
Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 Text Analytics Evaluation Case Study Text Analytics Features Noun Phrase Extraction Catalogs with variants, rule based dynamic Multiple types, custom classes – entities, concepts, events Feeds facets Summarization Customizable rules, map to different content Fact Extraction Relationships of entities – people-organizations-activities Ontologies – triples, RDF, etc. Sentiment Analysis Rules – Objects and phrases
5
Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 5 Introduction to Text Analytics Text Analytics Features Auto-categorization Training sets – Bayesian, Vector space Terms – literal strings, stemming, dictionary of related terms Rules – simple – position in text (Title, body, url) Semantic Network – Predefined relationships, sets of rules Boolean– Full search syntax – AND, OR, NOT Advanced – DIST (#), PARAGRAPH, SENTENCE This is the most difficult to develop Build on a Taxonomy Combine with Extraction If any of list of entities and other words
6
Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 6
7
#analytics2011 7
8
#analytics2011 8
9
#analytics2011 9
10
#analytics2011 10
11
Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 11
12
Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 Evaluating Text Analytics Software Start with Self Knowledge Strategic and Business Context Strategic Questions – why, what value from the taxonomy/text analytics, how are you going to use it Info Problems – what, how severe Formal Process - KA audit – content, users, technology, business and information behaviors, applications - Or informal for smaller organization, application specific initiatives Text Analytics Strategy/Model – forms, technology, people Existing taxonomic resources, software Need this foundation to evaluate and to develop 12
13
Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 Evaluating Text Analytics Software Start with Self Knowledge Do you need it – and what blend if so? Taxonomy Management Full Functionality Multiple taxonomies, languages, authors-editors Technology Environment – Text Mining, ECM, Enterprise Search Where is it embedded, integration issues Publishing Process – where and how is metadata being added – now and projected future Can it utilize auto-categorization, entity extraction, summarization Applications – text mining, BI, CI, Social Media, Mobile? 13
14
Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 Evaluation Process & Methodology Team - Interdisciplinary IT – Large software purchase, needs assessment »Text Analytics is different – semantics »Construction company designing your house Business – Understand the business needs »Don’t understand information »Restaurant owner doing the cooking Library - know information, search »Don’t understand the business, non-information experts »Accountant doing financial strategy Team – 3 KAPS - Information 5-8 Amdocs – SME - business, Technical. 14
15
Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 15 Evaluation Process & Methodology Amdocs Requirements / Initial Filters Platform – range of capabilities Categorization, Sentiment analysis, etc. Technical API’s, Java based, Linux run time Scalability – millions of documents a day Import-Export – XML, RDF Total Cost of Ownership Vendor Relationship - OEM Usability, Multiple Language Support
16
Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 Evaluation Process & Methodology Two Phases Phase I – Traditional Software Evaluation Filter One- Ask Experts - reputation, research – Gartner, etc. »Market strength of vendor, platforms, etc. Filter Two - Feature scorecard – minimum, must have, filter to top 3 Filter Three – Technology Filter – match to your overall scope and capabilities – Filter not a focus Filter Four – In-Depth Demo – 3-6 vendors Phase II - Deep POC (2) – advanced, integration, semantics 16
17
Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 Phase I – Case Study Attensity SAP – Inxight Clarabridge ClearForest Concept Searching Data Harmony / Access Innovations Expert Systems GATE (Open Source) IBM Lexalytics Multi-Tes Nstein SAS SchemaLogic Smart Logic Content Management Enterprise Search Sentiment Analysis Specialty Ontology Platforms 17
18
Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 18 Phase I - 4 Demos SmartLogic Taxonomy Management, good interface 20 types of entities, API’s, XML-Http Full Platform – no Sentiment Analysis Expert Systems Different Approach – Semantic Network – 400,000 words / 3,500 rules, 65 types of relationships Strong out of the box – 80%, no training sets Language concerns – no Spanish, high cost to develop new ones Customization – add terms and relationships, develop rules – uncertain how much effort, use their professional linguists
19
Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 19 Phase I - 4 Demos SAS- Content Categorization & Sentiment Full Platform – categorization, entity, sentiment – integrated API’s, XML, Java – ease of integration Strong history of company, range of experience IBM – Classification, Concept Analytics – Two products Classification Module – statistical emphasis »Once trained, it could “learn” new words »Rapid development / depends on training sets Content Analytics, Languageware Workbench »Full Platform
20
Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 20 Phase I – Findings SAS & IBM – Full Platform, OEM Experience, multilingual Proven ability to scale, customizable components, mature tool sets SAS was the strongest offering Capabilities, experience, integrated tool sets IBM good second choice Capabilities, experience - multiple products – strength and weakness Single Vendor POC - Demonstrate it can be done Ability to dive more deeply into capabilities, issues Stronger foundation for future development, Learn the software better Danger of missing better choice Two Vendor POC Balance of depth and full testing
21
Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 Phase II - Proof Of Concept - POC 4-6 weeks POC – bake off / or short pilot Measurable Quality of results is the essential factor Real life scenarios, categorization with your content 2-3 rounds of development, test, refine / Not OOB Need SME’s as test evaluators – also to do an initial categorization of content Majority of time is on auto-categorization Need to balance uniformity of results with vendor unique capabilities – have to determine at POC time Taxonomy Developers – expert consultants plus internal taxonomists 21
22
Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 Phase II – POC: Range of Evaluations Basic Question – Can this stuff work at all? Auto-categorization to existing taxonomy – variety of content Essential Issue is complexity of language Clustering – automatic node generation Summarization Entity extraction – build a number of catalogs – design which ones based on projected needs – example privacy info (SS#, phone, etc.) Entity example –people, organization, methods, etc. Essential issue is scale and disambiguation Evaluate usability in action by taxonomists 22
23
Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 23 Phase II – POC: Evaluation Criteria & Issues Basic Test Design – categorize test set Score – by file name, human testers Categorization & Sentiment – Accuracy 80-90% Effort Level per accuracy level Quantify development time – main elements Comparison of two vendors – how score? Combination of scores and report Quality of content & initial human categorization Normalize among different test evaluators Quality of taxonomists – experience with text analytics software and/or experience with content and information needs and behaviors Quality of taxonomy – structure, overlapping categories
24
Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 Phase II – POC: Risks CIO/CTO Problem –This is not a regular software process Language is messy not just complex 30% accuracy isn’t 30% done – could be 90% Variability of human categorization / expression Even professional writers – journalists examples Categorization is iterative, not “the program works” Need realistic budget and flexible project plan Anyone can do categorization Librarians often overdo, SME’s often get lost (keywords) Meta-language issues – understanding the results Need to educate IT and business in their language 24
25
Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 Text Analytics POC Outcomes Categorization of CSR Notes Content –2,000 CSR notes categorized by humans Variation among human categorization Recall (finding all the correct documents) Precision (not categorizing documents from other categories) Precision is harder than recall Two scores – raw and corrected – only raw for IBM precision First score was very low, with an extra round got it up Uncategorized documents – 50,000 – look at top 10 in each category 25
26
Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 Text Analytics POC Outcomes Categorization Results SASIBM Recall-Motivation92.690.7 Recall-Actions93.888.3 Precision – Mot.84.3 Precision-Act100 Uncategorized87.5 Raw Precision7346 26
27
Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 Text Analytics POC Outcomes Vendor Comparisons SAS has a much more complete set of operators – NOT, DIST, ORDDIST, START IBM team was able to develop work arounds for some – more development effort Operators impact most other features – Sentiment analysis, Entity and Fact Extraction, Summarization, etc. SAS has relevancy – can be used for precision, applications Sentiment Analysis – SAS has workbench, IBM would require more development SAS also has statistical modeling capabilities Development Environment & Methodology IBM as toolkit provides more flexibility but it also increases development effort, enforces good method 27
28
Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 Text Analytics POC Outcomes Vendor Comparisons - Conclusions Both can do the job Product vs. Tool Kit (SAS has toolkit capabilities also) IBM will require more development effort Boolean Operators – NOT, DIST, ORDDIST, START, etc. »In rules, entity and fact extraction Sentiment Analysis – rules, statistical Summarization Rule building more programming than taxonom y IBM harder to learn – POC had 2X effort for IBM Conclusion: Buy SAS ECC and Sentiment Workbench 28
29
Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 29 Sentiment Analysis Development Process Combination of Statistical and categorization rules Start with Training sets – examples of positive, negative, neutral documents Develop a Statistical Model Generate domain positive and negative words and phrases Develop a taxonomy of Products & Features Develop rules for positive and negative statements Test and Refine Test and Refine again
30
Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 30
31
Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 31
32
Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 32
33
Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 33
34
Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 34
35
Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 35 Beyond Sentiment: Behavior Prediction Case Study – Telecom Customer Service Problem – distinguish customers likely to cancel from mere threats Analyze customer support notes General issues – creative spelling, second hand reports Develop categorization rules First – distinguish cancellation calls – not simple Second - distinguish cancel what – one line or all Third – distinguish real threats
36
Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 36 Beyond Sentiment Behavior Prediction – Case Study Basic Rule (START_20, (AND, (DIST_7,"[cancel]", "[cancel-what-cust]"), (NOT,(DIST_10, "[cancel]", (OR, "[one-line]", "[restore]", “[if]”))))) Examples: customer called to say he will cancell his account if the does not stop receiving a call from the ad agency. cci and is upset that he has the asl charge and wants it off or her is going to cancel his act ask about the contract expiration date as she wanted to cxl teh acct Combine sophisticated rules with sentiment statistical training and Predictive Analytics
37
Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 37 Beyond Sentiment - Wisdom of Crowds Crowd Sourcing Technical Support Example – Android User Forum Develop a taxonomy of products, features, problem areas Develop Categorization Rules: “I use the SDK method and it isn't to bad a all. I'll get some pics up later, I am still trying to get the time to update from fresh 1.0 to 1.1.” Find product & feature – forum structure Find problem areas in response, nearby text for solution Automatic – simply expose lists of “solutions” Search Based application Human mediated – experts scan and clean up solutions
38
Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 38 Beyond Sentiment: Expertise Analysis Apply Sentiment Analysis techniques to Expertise Expertise Characterization for individuals, communities, documents, and sets of documents Experts prefer lower, subordinate levels Novice prefer higher, superordinate levels General Populace prefers basic level Experts language structure is different Focus on procedures over content Develop expertise rules – sentiment and categorization
39
Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 39 Expertise Analysis Expertise – application areas Taxonomy / Ontology development /design – audience focus Card sorting – non-experts use superficial similarities Business & Customer intelligence – add expertise to sentiment Deeper research into communities, customer s Text Mining - Expertise characterization of writer, corpus eCommerce – Organization/Presentation of information – expert, novice Expertise location- Generate automatic expertise characterization based on documents Experiments - Pronoun Analysis – personality types Essay Evaluation Software - Apply to expertise characterization »Model levels of chunking, procedure words over content
40
Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 Text Analytics Evaluation Conclusions Start with Self Knowledge – text analytics not an end in itself Initial Evaluation – filters, not scorecards Weights change output – need self knowledge for good weights Proof of Concept – essential OOB doesn’t tell you how it will work in real world Content and Scenarios is your real world Good idea even if you know SAS is the answer Importance of operators, relevance for a platform Sentiment needs full platform capabilities Everyone has room for improvement 40
41
Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 Text Analytics Future Directions Start with the 80% of significant content that is not data Enterprise search, content management, Search based applications Text Analytics and Text Mining Text Analytics turns text into data – Build better TM Apps Better extraction and add Subject / Concepts Sentiment and Beyond – Behavior, Expertise Text Mining and Text Analytics TM enriching TA Taxonomy development New Content Structures, ensemble models Text Analytics and Predictive Analytics More content, New content – social, interactive – CSR New sources of content/data = new & better apps Add Learning & Cognitive Science and the future is ? 41
42
Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 Questions? Tom Reamy tomr@kapsgroup.com KAPS Group Knowledge Architecture Professional Services http://www.kapsgroup.com
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.