Download presentation
Presentation is loading. Please wait.
Published bySilas Watson Modified over 9 years ago
1
Applying Semantics to Search Text Analytics Tom Reamy Chief Knowledge Architect KAPS Group http://www.kapsgroup.com Enterprise Search Summit New York
2
2 Agenda Introduction – Search, Semantics, Text Analytics – How do you mean? Getting (Re)Started with Text Analytics – 3 ½ steps Preliminary: Strategic Vision – What is text analytics and what can it do? Step 1: Self Knowledge – TA Audit Step 2: Text Analytics Software Evaluation Step 3: POC / Quick Start – Pilot to Development Rest of your Life: Refinement, Feedback, Learning Conclusions
3
3 KAPS Group: General Knowledge Architecture Professional Services – Network of Consultants Partners – SAS, SAP, IBM, FAST, Smart Logic, Concept Searching – Attensity, Clarabridge, Lexalytics, Strategy – IM & KM - Text Analytics, Social Media, Integration Services: – Taxonomy/Text Analytics development, consulting, customization – Text Analytics Fast Start – Audit, Evaluation, Pilot – Social Media: Text based applications – design & development Clients: – Genentech, Novartis, Northwestern Mutual Life, Financial Times, Hyatt, Home Depot, Harvard Business Library, British Parliament, Battelle, Amdocs, FDA, GAO, etc. Applied Theory – Faceted taxonomies, complexity theory, natural categories, emotion taxonomies Presentations, Articles, White Papers – http://www.kapsgroup.comhttp://www.kapsgroup.com
4
4 Introduction: Search, Semantics, Text Analytics What do you mean? All Search is (should be) semantic – Humans search concepts not chicken scratches Is this semantics? – NLP, Concept Search, Semantic Web (ontologies) Meaning in Text – Text Analytics – categorization – Extraction – noun phrase, facts-triples Meaning from Search Results – A conversation, not a list of ranked (poorly) documents
5
5 What is Text Analytics? Text Analytics Features Noun Phrase Extraction – Catalogs with variants, rule based dynamic – Multiple types, custom classes – entities, concepts, events – Feeds facets Summarization – Customizable rules, map to different content Fact Extraction – Relationships of entities – people-organizations-activities – Ontologies – triples, RDF, etc. Sentiment Analysis – Rules & statistical – Objects, products, companies, and phrases
6
6 What is Text Analytics? Text Analytics Features Auto-categorization – Training sets – Bayesian, Vector space – Terms – literal strings, stemming, dictionary of related terms – Rules – simple – position in text (Title, body, url) – Semantic Network – Predefined relationships, sets of rules – Boolean– Full search syntax – AND, OR, NOT – Advanced – DIST(#), ORDDIST#, PARAGRAPH, SENTENCE This is the most difficult to develop Build on a Taxonomy Combine with Extraction – If any of list of entities and other words – Disambiguation - Ford
7
Case Study – Categorization & Sentiment 7
8
8
9
9
10
10
11
11
12
12
13
13
14
14 Preliminary: Text Analytics Vision What can Text Analytics Do? Strategic Questions – why, what value from the text analytics, how are you going to use it – Platform or Applications? What are the basic capabilities of Text Analytics? What can Text Analytics do for Search? – After 10 years of failure – get search to work? What can you do with smart search based applications? – RM, PII, Social ROI for effective search – difficulty of believing – Problems with metadata, taxonomy
15
Preliminary: Text Analytics Vision Adding Structure to Unstructured Content How do you bridge the gap – taxonomy to documents? Tagging documents with taxonomy nodes is tough – And expensive – central or distributed Library staff –experts in categorization not subject matter – Too limited, narrow bottleneck – Often don’t understand business processes and business uses Authors – Experts in the subject matter, terrible at categorization – Intra and Inter inconsistency, “intertwingleness” – Choosing tags from taxonomy – complex task – Folksonomy – almost as complex, wildly inconsistent – Resistance – not their job, cognitively difficult = non-compliance Text Analytics is the answer(s)! 15
16
Preliminary: Text Analytics Vision Adding Structure to Unstructured Content Text Analytics and Taxonomy Together – Platform – Text Analytics provides the power to apply the taxonomy – And metadata of all kinds – Consistent in every dimension, powerful and economic Hybrid Model – Publish Document -> Text Analytics analysis -> suggestions for categorization, entities, metadata - > present to author – Cognitive task is simple -> react to a suggestion instead of select from head or a complex taxonomy – Feedback – if author overrides -> suggestion for new category – Facets – Requires a lot of Metadata - Entity Extraction feeds facets Hybrid – Automatic is really a spectrum – depends on context – Automatic – adding structure at search results 16
17
Step 1 : TA Information Audit Start with Self Knowledge Info Problems – what, how severe Formal Process - KA audit – content, users, technology, business and information behaviors, applications - Or informal for smaller organization, Contextual interviews, content analysis, surveys, focus groups, ethnographic studies, Text Mining Category modeling – Cognitive Science – how people think Natural level categories mapped to communities, activities Novice prefer higher levels Balance of informative and distinctiveness Text Analytics Strategy/Model – forms, technology, people 17
18
Step 1 : TA Information Audit Start with Self Knowledge Ideas – Content and Content Structure – Map of Content – Tribal language silos – Structure – articulate and integrate – Taxonomic resources People – Producers & Consumers – Communities, Users, Central Team Activities – Business processes and procedures – Semantics, information needs and behaviors – Information Governance Policy Technology – CMS, Search, portals, text analytics – Applications – BI, CI, Semantic Web, Text Mining 18
19
19 Step 2: TA Evaluation Varieties of Taxonomy/ Text Analytics Software Taxonomy Management - extraction Full Platform – SAS, SAP, Smart Logic, Concept Searching, Expert System, IBM, Linguamatics, GATE Embedded – Search or Content Management – FAST, Autonomy, Endeca, Vivisimo, NLP, etc. – Interwoven, Documentum, etc. Specialty / Ontology (other semantic) – Sentiment Analysis – Attensity, Lexalytics, Clarabridge, Lots – Ontology – extraction, plus ontology
20
Step 2: Text Analytics Evaluation Different Kind of software evaluation Traditional Software Evaluation - Start – Filter One- Ask Experts - reputation, research – Gartner, etc. Market strength of vendor, platforms, etc. Feature scorecard – minimum, must have, filter to top 6 – Filter Two – Technology Filter – match to your overall scope and capabilities – Filter not a focus – Filter Three – In-Depth Demo – 3-6 vendors Reduce to 1-3 vendors Vendors have different strengths in multiple environments – Millions of short, badly typed documents, Build application – Library 200 page PDF, enterprise & public search 20
21
Design of the Text Analytics Selection Team Traditional Candidates – IT&, Business, Library IT - Experience with software purchases, needs assess, budget – Search/Categorization is unlike other software, deeper look Business -understand business, focus on business value They can get executive sponsorship, support, and budget – But don’t understand information behavior, semantic focus Library, KM - Understand information structure Experts in search experience and categorization – But don’t understand business or technology 21
22
Design of the Text Analytics Selection Team Interdisciplinary Team, headed by Information Professionals Relative Contributions – IT – Set necessary conditions, support tests – Business – provide input into requirements, support project – Library – provide input into requirements, add understanding of search semantics and functionality Much more likely to make a good decision Create the foundation for implementation 22
23
Step 3: Proof of Concept / Pilot Project 4 weeks POC – bake off / or short pilot Real life scenarios, categorization with your content 2 rounds of development, test, refine / Not OOB Need SME’s as test evaluators – also to do an initial categorization of content Measurable Quality of results is the essential factor Majority of time is on auto-categorization Need to balance uniformity of results with vendor unique capabilities – have to determine at POC time Taxonomy Developers – expert consultants plus internal taxonomists 23
24
24 Step 3 : Proof of Concept POC Design: Evaluation Criteria & Issues Basic Test Design – categorize test set – Score – by file name, human testers Categorization & Sentiment – Accuracy 80-90% – Effort Level per accuracy level Quantify development time – main elements Comparison of two vendors – how score? – Combination of scores and report Quality of content & initial human categorization – Normalize among different test evaluators Quality of taxonomists – experience with text analytics software and/or experience with content and information needs and behaviors Quality of taxonomy – structure, overlapping categories
25
Step 3: Proof of Concept POC and Early Development: Risks and Issues CTO Problem –This is not a regular software process Semantics is messy not just complex – 30% accuracy isn’t 30% done – could be 90% Variability of human categorization Categorization is iterative, not “the program works” – Need realistic budget and flexible project plan Anyone can do categorization – Librarians often overdo, SME’s often get lost (keywords) Meta-language issues – understanding the results – Need to educate IT and business in their language 25
26
Step 3: Proof of Concept / Quick Start Outcomes POC – understand how text analytics can work in your environment Learn the software – internal resources trained by doing Learn the language – syntax (Advanced Boolean) Learn categorization and extraction Good catego rization rules – Balance of general and specific – Balance of recall and precision Develop or refine taxonomies for categorization POC – can be the Quick Start or the Start of the Quick Start 26
27
Development, Implementation Quick Start – First Application: Search and TA Simple Subject Taxonomy structure – Easy to develop and maintain Combined with categorization capabilities – Added power and intelligence Combined with people tagging, refining tags Combined with Faceted Metadata – Dynamic selection of simple categories – Allow multiple user perspectives Can’t predict all the ways people think Monkey, Banana, Panda Combined with ontologies and semantic data – Multiple applications – Text mining to Search – Combine search and browse 27
28
3. Roles and Responsibilities Sample roles matrix: 28
29
3. Roles and Responsibilities Common Roles and SharePoint Permissions: 29
30
Rest of Your Life: Maintenance, Refinement, Application, Learning This is easy – if you did the TA Audit and POC/Quick Start Content – new content – calls for flexible, new methods People – Have a trained team and extended team Technology – integrate into variety of applications – SBA Processes, workflow – how semi-automate, part of normal Maintenance – Refinement – in world of rapid change – Mechanisms for feedback, learning – of text analysts and software Future Directions - Advanced Applications – Embedded Applications, Semantic Web + Unstructured Content – Integration of Enterprise and External - Social Media – Expertise Analysis, Behavior Prediction (Predictive Analytics) – Voice of the Customer, Big Data – Turning unstructured content into data – new worlds 30
31
Conclusion Text Analytics can fulfill the promise of taxonomy and metadata – Economic and consistent structure for unstructured content Search and Text Analytics – Search that works – finally! – Platform for Search-Based Applications Text Analytics is different kind of software / solution – Infrastructure – Hybrid CM to Search and feedback How to Get Started with Text Analytics – Strategic Vision of Text Analytics – Three steps – TA Audit, TA evaluation, POC/Quick Start Text Analytics opens up new worlds of applications 31
32
Questions? Tom Reamy tomr@kapsgroup.com KAPS Group Knowledge Architecture Professional Services http://www.kapsgroup.com www.TextAnalyticsWorld.comwww.TextAnalyticsWorld.com Oct 3-4, Boston
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.