Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 Text Analytics Evaluation A Case Study: Amdocs Tom Reamy Chief Knowledge Architect.

Slides:

Advertisements

Similar presentations

Taxonomy Development in an Enterprise Context Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

Advertisements

Taxonomy Development An Infrastructure Model Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

Taxonomy & Ontology Impact on Search Infrastructure John R. McGrath Sr. Director, Fast Search & Transfer.

Top Tips Enterprise Content Management Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

Metadata Strategies Alternatives for creating value from metadata Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services.

Improving Navigation and Findability Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

Beyond Sentiment New Dimensions for Social Media A Panel Discussion of Trends and Ideas Dave Hills, Twelvefold Media Mike Lazarus, Atigeo, LLC Moderator:

Copyright © 2012, SAS Institute Inc. All rights reserved. #analytics2012 Quick Start for Text Analytics Tom Reamy Chief Knowledge Architect KAPS Group.

Enterprise Information Architecture A Platform for Integrating Your Organization’s Information and Knowledge Activities Tom Reamy Chief Knowledge Architect.

Faceted Navigation: Search and Browse Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

Taxonomy Development Case Studies

Innovation in Search? Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

Model of Taxonomy Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

Knowledge Architecture Process & Case Studies Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

Taxonomy Boot Camp Panel Text Analytics Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

Improving Search for Discovery Tom Reamy Chief Knowledge Architect KAPS Group Program Chair – Text Analytics World Knowledge Architecture Professional.

Automatic Facets: Faceted Navigation and Entity Extraction Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services.

© Tefko Saracevic, Rutgers University 1 EVALUATION in searching IR systems Digital libraries Reference sources Web sources.

Beyond Sentiment Mining Social Media A Panel Discussion of Trends and Ideas Marie Wallace, IBM Marcello Pellacani, Expert System Fabio Lazzarini, CRIBIS.

Enterprise Semantic Infrastructure Workshop Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

Beyond Sentiment Mining Social Media Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

Facets and Faceted Navigation Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

Expanding Enterprise Roles for Librarians Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

Text Analytics Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

Best of Both Worlds Text Analytics and Text Mining Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

Selecting Taxonomy Software Who, Why, How Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

Taxonomy and Knowledge Organization Taxonomy in Context Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

Cyborg Categorization Salvation for Search? Tom Reamy Information Architect Charles Schwab © 2001 Charles Schwab & Co., Inc., member NYSE/SIPC. All rights.

Building a Foundation for Info Apps Tom Reamy Chief Knowledge Architect KAPS Group Program Chair – Text Analytics World Knowledge Architecture Professional.

Enterprise Search/ Text Analytics Evaluation Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

Text Analytics And Text Mining Best of Text and Data

Best of All Worlds Text Analytics and Text Mining and Taxonomy Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services.

New Directions in Social Media Tom Reamy Chief Knowledge Architect KAPS Group

SemTech Text Analytics Evaluation Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

Text Analytics and Taxonomies Tom Reamy Chief Knowledge Architect KAPS Group

Smart Text How to Turn Big Text into Big Data Tom Reamy Chief Knowledge Architect KAPS Group Program Chair – Text Analytics World.

Integrating an Enterprise Taxonomy with Local Variations Tom Reamy Chief Knowledge Architect KAPS Group Program Chair – Text Analytics World Knowledge.

Applying Semantics to Search Text Analytics Tom Reamy Chief Knowledge Architect KAPS Group Enterprise Search Summit New York.

Text Analytics Workshop Applications Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

Taxonomy and Social Media Social Taxonomies Tom Reamy Chief Knowledge Architect KAPS Group Program Chair – Text Analytics World Knowledge Architecture.

Content Categorization Tools Taxonomies & Technologies for Infrastructure Solutions Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture.

Text Analytics Summit Text Analytics Evaluation Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

Text Analytics Software Choosing the Right Fit Tom Reamy Chief Knowledge Architect KAPS Group Text Analytics World October 20.

New Directions in Social Media Tom Reamy Chief Knowledge Architect KAPS Group

Semantic Infrastructure Workshop Applications Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

Metadata and Taxonomies The Best of Both Worlds Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

Integrating an Enterprise Taxonomy with Local Variations Tom Reamy Chief Knowledge Architect KAPS Group Taxonomy Boot Camp.

Text Analytics Mini-Workshop Quick Start Tom Reamy Chief Knowledge Architect KAPS Group Program Chair – Text Analytics World Knowledge Architecture Professional.

Enterprise Semantic Infrastructure Workshop Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

Folksonomy Folktales Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

Selecting Taxonomy Software Who, Why, How Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

Text Analytics Workshop Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

Advanced Semantics and Search Beyond Tag Clouds and Taxonomies Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services.

Text Analytics for Search Applications Workshop Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

Text Analytics A Tool for Taxonomy Development Tom Reamy Chief Knowledge Architect KAPS Group Program Chair – Text Analytics World Knowledge Architecture.

Text Analytics Workshop Applications Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

Text Analytics Workshop Tom Reamy Chief Knowledge Architect KAPS Group Program Chair – Text Analytics World Knowledge Architecture Professional Services.

Knowledge Retrieval Taxonomies & Auto-Categorization Tom Reamy Knowledge Architect Intranet Consultant.

Taxonomy and Text Analytics Case Studies Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

Taxonomy Development An Infrastructure Model Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

Deep Text New Approaches in Text Analytics and Knowledge Organization Tom Reamy Chief Knowledge Architect KAPS Group Author: Deep.

Tom Reamy Chief Knowledge Architect KAPS Group

Tom Reamy Chief Knowledge Architect KAPS Group

Enterprise Social Networks A New Semantic Foundation

Program Chair: Tom Reamy Chief Knowledge Architect

Text Analytics Workshop: Introduction

Program Chair: Tom Reamy Chief Knowledge Architect

Expertise Location Basic Level Categories

Presentation transcript:

Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 Text Analytics Evaluation A Case Study: Amdocs Tom Reamy Chief Knowledge Architect KAPS Group

Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 Text Analytics Evaluation Case Study  Agenda Introduction – Text Analytics Basics Evaluation Process & Methodology Two Stages – Initial Filters & POC Initial Evaluation Results Proof of Concept Methodology Results Final Recommendation Sentiment Analysis and Beyond Conclusions

Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics KAPS Group: General  Knowledge Architecture Professional Services  Virtual Company: Network of consultants – 8-10  Partners – SAS – 2 Whitepapers (Semantic infrastructure)  GAO, FDA, Amdocs – Sales & Development Projects  Other Partners: Smart Logic, FAST, Concept Searching, etc.  Consulting, Strategy, Knowledge architecture audit  Services:  Text Analytics evaluation, development, consulting, customization  Knowledge Representation – taxonomy, ontology, Prototype  Knowledge Management: Collaboration, Expertise, e-learning  Applied Theory – Faceted taxonomies, complexity theory, natural categories

Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 Text Analytics Evaluation Case Study Text Analytics Features  Noun Phrase Extraction  Catalogs with variants, rule based dynamic  Multiple types, custom classes – entities, concepts, events  Feeds facets  Summarization  Customizable rules, map to different content  Fact Extraction  Relationships of entities – people-organizations-activities  Ontologies – triples, RDF, etc.  Sentiment Analysis  Rules – Objects and phrases

Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics Introduction to Text Analytics Text Analytics Features  Auto-categorization  Training sets – Bayesian, Vector space  Terms – literal strings, stemming, dictionary of related terms  Rules – simple – position in text (Title, body, url)  Semantic Network – Predefined relationships, sets of rules  Boolean– Full search syntax – AND, OR, NOT  Advanced – DIST (#), PARAGRAPH, SENTENCE  This is the most difficult to develop  Build on a Taxonomy  Combine with Extraction  If any of list of entities and other words

Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 6

#analytics2011 7

#analytics2011 8

#analytics2011 9

#analytics

Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics

Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 Evaluating Text Analytics Software Start with Self Knowledge  Strategic and Business Context  Strategic Questions – why, what value from the taxonomy/text analytics, how are you going to use it  Info Problems – what, how severe  Formal Process - KA audit – content, users, technology, business and information behaviors, applications - Or informal for smaller organization, application specific initiatives  Text Analytics Strategy/Model – forms, technology, people  Existing taxonomic resources, software  Need this foundation to evaluate and to develop 12

Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 Evaluating Text Analytics Software Start with Self Knowledge  Do you need it – and what blend if so?  Taxonomy Management Full Functionality  Multiple taxonomies, languages, authors-editors  Technology Environment – Text Mining, ECM, Enterprise Search  Where is it embedded, integration issues  Publishing Process – where and how is metadata being added – now and projected future  Can it utilize auto-categorization, entity extraction, summarization  Applications – text mining, BI, CI, Social Media, Mobile? 13

Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 Evaluation Process & Methodology Team - Interdisciplinary  IT – Large software purchase, needs assessment »Text Analytics is different – semantics »Construction company designing your house  Business – Understand the business needs »Don’t understand information »Restaurant owner doing the cooking  Library - know information, search »Don’t understand the business, non-information experts »Accountant doing financial strategy  Team – 3 KAPS - Information  5-8 Amdocs – SME - business, Technical. 14

Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics Evaluation Process & Methodology Amdocs Requirements / Initial Filters  Platform – range of capabilities  Categorization, Sentiment analysis, etc.  Technical  API’s, Java based, Linux run time  Scalability – millions of documents a day  Import-Export – XML, RDF  Total Cost of Ownership  Vendor Relationship - OEM  Usability, Multiple Language Support

Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 Evaluation Process & Methodology Two Phases  Phase I – Traditional Software Evaluation  Filter One- Ask Experts - reputation, research – Gartner, etc. »Market strength of vendor, platforms, etc.  Filter Two - Feature scorecard – minimum, must have, filter to top 3  Filter Three – Technology Filter – match to your overall scope and capabilities – Filter not a focus  Filter Four – In-Depth Demo – 3-6 vendors  Phase II - Deep POC (2) – advanced, integration, semantics 16

Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 Phase I – Case Study  Attensity  SAP – Inxight  Clarabridge  ClearForest  Concept Searching  Data Harmony / Access Innovations  Expert Systems  GATE (Open Source)  IBM  Lexalytics  Multi-Tes  Nstein  SAS  SchemaLogic  Smart Logic  Content Management  Enterprise Search  Sentiment Analysis Specialty  Ontology Platforms 17

Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics Phase I - 4 Demos  SmartLogic  Taxonomy Management, good interface  20 types of entities, API’s, XML-Http  Full Platform – no Sentiment Analysis  Expert Systems  Different Approach – Semantic Network – 400,000 words / 3,500 rules, 65 types of relationships  Strong out of the box – 80%, no training sets  Language concerns – no Spanish, high cost to develop new ones  Customization – add terms and relationships, develop rules – uncertain how much effort, use their professional linguists

Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics Phase I - 4 Demos  SAS- Content Categorization & Sentiment  Full Platform – categorization, entity, sentiment – integrated  API’s, XML, Java – ease of integration  Strong history of company, range of experience  IBM – Classification, Concept Analytics – Two products  Classification Module – statistical emphasis »Once trained, it could “learn” new words »Rapid development / depends on training sets  Content Analytics, Languageware Workbench »Full Platform

Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics Phase I – Findings  SAS & IBM – Full Platform, OEM Experience, multilingual  Proven ability to scale, customizable components, mature tool sets  SAS was the strongest offering  Capabilities, experience, integrated tool sets  IBM good second choice  Capabilities, experience - multiple products – strength and weakness  Single Vendor POC - Demonstrate it can be done  Ability to dive more deeply into capabilities, issues  Stronger foundation for future development, Learn the software better  Danger of missing better choice  Two Vendor POC  Balance of depth and full testing

Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 Phase II - Proof Of Concept - POC  4-6 weeks POC – bake off / or short pilot  Measurable Quality of results is the essential factor  Real life scenarios, categorization with your content  2-3 rounds of development, test, refine / Not OOB  Need SME’s as test evaluators – also to do an initial categorization of content  Majority of time is on auto-categorization  Need to balance uniformity of results with vendor unique capabilities – have to determine at POC time  Taxonomy Developers – expert consultants plus internal taxonomists 21

Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 Phase II – POC: Range of Evaluations  Basic Question – Can this stuff work at all?  Auto-categorization to existing taxonomy – variety of content  Essential Issue is complexity of language  Clustering – automatic node generation  Summarization  Entity extraction – build a number of catalogs – design which ones based on projected needs – example privacy info (SS#, phone, etc.)  Entity example –people, organization, methods, etc.  Essential issue is scale and disambiguation  Evaluate usability in action by taxonomists 22

Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics Phase II – POC: Evaluation Criteria & Issues  Basic Test Design – categorize test set  Score – by file name, human testers  Categorization & Sentiment – Accuracy 80-90%  Effort Level per accuracy level  Quantify development time – main elements  Comparison of two vendors – how score?  Combination of scores and report  Quality of content & initial human categorization  Normalize among different test evaluators  Quality of taxonomists – experience with text analytics software and/or experience with content and information needs and behaviors  Quality of taxonomy – structure, overlapping categories

Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 Phase II – POC: Risks  CIO/CTO Problem –This is not a regular software process  Language is messy not just complex  30% accuracy isn’t 30% done – could be 90%  Variability of human categorization / expression  Even professional writers – journalists examples  Categorization is iterative, not “the program works”  Need realistic budget and flexible project plan  Anyone can do categorization  Librarians often overdo, SME’s often get lost (keywords)  Meta-language issues – understanding the results  Need to educate IT and business in their language 24

Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 Text Analytics POC Outcomes Categorization of CSR Notes  Content –2,000 CSR notes categorized by humans  Variation among human categorization  Recall (finding all the correct documents)  Precision (not categorizing documents from other categories)  Precision is harder than recall  Two scores – raw and corrected – only raw for IBM precision  First score was very low, with an extra round got it up  Uncategorized documents – 50,000 – look at top 10 in each category 25

Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 Text Analytics POC Outcomes Categorization Results SASIBM Recall-Motivation Recall-Actions Precision – Mot.84.3 Precision-Act100 Uncategorized87.5 Raw Precision

Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 Text Analytics POC Outcomes Vendor Comparisons  SAS has a much more complete set of operators – NOT, DIST, ORDDIST, START  IBM team was able to develop work arounds for some – more development effort  Operators impact most other features – Sentiment analysis, Entity and Fact Extraction, Summarization, etc.  SAS has relevancy – can be used for precision, applications  Sentiment Analysis – SAS has workbench, IBM would require more development  SAS also has statistical modeling capabilities  Development Environment & Methodology  IBM as toolkit provides more flexibility but it also increases development effort, enforces good method 27

Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 Text Analytics POC Outcomes Vendor Comparisons - Conclusions  Both can do the job  Product vs. Tool Kit (SAS has toolkit capabilities also)  IBM will require more development effort  Boolean Operators – NOT, DIST, ORDDIST, START, etc. »In rules, entity and fact extraction  Sentiment Analysis – rules, statistical  Summarization  Rule building more programming than taxonom y  IBM harder to learn – POC had 2X effort for IBM  Conclusion: Buy SAS ECC and Sentiment Workbench 28

Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics Sentiment Analysis Development Process  Combination of Statistical and categorization rules  Start with Training sets – examples of positive, negative, neutral documents  Develop a Statistical Model  Generate domain positive and negative words and phrases  Develop a taxonomy of Products & Features  Develop rules for positive and negative statements  Test and Refine  Test and Refine again

Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics

Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics

Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics

Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics

Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics

Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics Beyond Sentiment: Behavior Prediction Case Study – Telecom Customer Service  Problem – distinguish customers likely to cancel from mere threats  Analyze customer support notes  General issues – creative spelling, second hand reports  Develop categorization rules  First – distinguish cancellation calls – not simple  Second - distinguish cancel what – one line or all  Third – distinguish real threats

Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics Beyond Sentiment Behavior Prediction – Case Study  Basic Rule  (START_20, (AND,  (DIST_7,"[cancel]", "[cancel-what-cust]"),  (NOT,(DIST_10, "[cancel]", (OR, "[one-line]", "[restore]", “[if]”)))))  Examples:  customer called to say he will cancell his account if the does not stop receiving a call from the ad agency.  cci and is upset that he has the asl charge and wants it off or her is going to cancel his act  ask about the contract expiration date as she wanted to cxl teh acct Combine sophisticated rules with sentiment statistical training and Predictive Analytics

Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics Beyond Sentiment - Wisdom of Crowds Crowd Sourcing Technical Support  Example – Android User Forum  Develop a taxonomy of products, features, problem areas  Develop Categorization Rules:  “I use the SDK method and it isn't to bad a all. I'll get some pics up later, I am still trying to get the time to update from fresh 1.0 to 1.1.”  Find product & feature – forum structure  Find problem areas in response, nearby text for solution  Automatic – simply expose lists of “solutions”  Search Based application  Human mediated – experts scan and clean up solutions

Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics Beyond Sentiment: Expertise Analysis  Apply Sentiment Analysis techniques to Expertise  Expertise Characterization for individuals, communities, documents, and sets of documents  Experts prefer lower, subordinate levels  Novice prefer higher, superordinate levels  General Populace prefers basic level  Experts language structure is different  Focus on procedures over content  Develop expertise rules – sentiment and categorization

Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics Expertise Analysis Expertise – application areas  Taxonomy / Ontology development /design – audience focus  Card sorting – non-experts use superficial similarities  Business & Customer intelligence – add expertise to sentiment  Deeper research into communities, customer s  Text Mining - Expertise characterization of writer, corpus  eCommerce – Organization/Presentation of information – expert, novice  Expertise location- Generate automatic expertise characterization based on documents  Experiments - Pronoun Analysis – personality types  Essay Evaluation Software - Apply to expertise characterization »Model levels of chunking, procedure words over content

Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 Text Analytics Evaluation Conclusions  Start with Self Knowledge – text analytics not an end in itself  Initial Evaluation – filters, not scorecards  Weights change output – need self knowledge for good weights  Proof of Concept – essential  OOB doesn’t tell you how it will work in real world  Content and Scenarios is your real world  Good idea even if you know SAS is the answer  Importance of operators, relevance for a platform  Sentiment needs full platform capabilities  Everyone has room for improvement 40

Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 Text Analytics Future Directions  Start with the 80% of significant content that is not data  Enterprise search, content management, Search based applications  Text Analytics and Text Mining  Text Analytics turns text into data – Build better TM Apps  Better extraction and add Subject / Concepts  Sentiment and Beyond – Behavior, Expertise  Text Mining and Text Analytics  TM enriching TA  Taxonomy development  New Content Structures, ensemble models  Text Analytics and Predictive Analytics  More content, New content – social, interactive – CSR  New sources of content/data = new & better apps  Add Learning & Cognitive Science and the future is ? 41

Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 Questions? Tom Reamy KAPS Group Knowledge Architecture Professional Services