Intelligent Interactions with Search Results Getting Beyond Those Blue Results Lists (or Smart Text) Tom Reamy Chief Knowledge Architect KAPS Group http://www.kapsgroup.com Program Chair – Text Analytics World Taxonomy Boot Camp, KMWorld, Enterprise Search Summit: Nov. Washington DC
Agenda Case Study Beyond Search – Building on the Foundation Introduction: Search and Structure: Smart Text Smart Text– foundation of text analytics Adding Structure to Unstructured Text Dynamic Sections and more, Better Relevancy Calculations Complex Document Summaries, Deeper Personalization Case Study Publishing: Processing 700K Proposals Beyond Search – Building on the Foundation Conclusions
KAPS Group: General Clients: Knowledge Architecture Professional Services – Network of Consultants Strategy – IM & KM - Text Analytics, Social Media, Integration Services: Taxonomy/Text Analytics development, consulting, customization Text Analytics Fast Start – Audit, Evaluation, Pilot Social Media: Text based applications – design & development Clients: Genentech, Novartis, Northwestern Mutual Life, Financial Times, Hyatt, Home Depot, Harvard Business Library, British Parliament, Battelle, Amdocs, FDA, GAO, DOT, World Bank, etc. Partners – Expert System, SAS, SAP, IBM, FAST, Smart Logic, Concept Searching, Attensity, Clarabridge, Lexalytics, Applied Theory – Faceted taxonomies, complexity theory, natural categories, emotion taxonomies Presentations, Articles, White Papers – http://www.kapsgroup.com
Introduction: Elements of Smart Text - Text Analytics Text Mining – NLP, statistical, predictive, machine learning Extraction – entities – known and unknown, concepts, events Disambiguation - Ford Fact Extraction - ontology, relationships of entities Sentiment Analysis - Positive Negative – products, companies, Auto-categorization Training sets, Terms Rules – simple – position in text (Title, body, url) Boolean– Full search syntax – AND, OR, NOT Advanced – DIST(#), ORDDIST#, PARAGRAPH, SENTENCE Based on taxonomy/ontology
Enterprise Text Analytics Search is still #1 = 30-50% of applications New Standard Search – facets (more and more metadata), auto-categorization built on taxonomies, clustering Trend = Text Analytics/Search as Semantic Infrastructure Platform for Info Apps (Search-based applications) SharePoint – Major focus of TA companies – fix problems with taxonomy/folksonomy Hybrid workflow – Publish document -> TA analysis -> suggestions for categorization, entities, metadata -> present to author External information = more automation, extraction – precision more important
Enterprise Text Analytics Adding Structure to Unstructured Content Beyond Documents – categorization by corpus, by page, sections or even sentence or phrase Documents are not unstructured – variety of structures Text indicators to define sections of the document Objectives, Abstract, Purpose, Aim – all the “same” section Sections – Specific - “Abstract” to Function “Evidence” Start of section is easy – where does it end? Experiment – clusters / vocabulary to define section Textual complexity, level of generality
Enterprise Text Analytics Categorization and Beyond Need to develop flexible categorization and taxonomy – tweets to 200 page PDF Rules or sample documents? Need more precision and granularity than documents can do Training sets – not as easy as thought Applications require sophisticated rules, not just categorization by similarity Separate logic of the rules from the text Stable rules, changing text Scores – relevancy with thresholds Not just frequency of words
Enterprise Text Analytics Document Type Rules (START_2000, (AND, (OR, _/article:"[Abstract]", _/article:"[Methods]“), (OR,_/article:"clinical trial*", _/article:"humans", (NOT, (DIST_5, (OR,_/article:"approved", _/article:"safe", _/article:"use", _/article:"animals"), If the article has sections like Abstract or Methods AND has phrases around “clinical trials / Humans” and not words like “animals” within 5 words of “clinical trial” words – count it and add up a relevancy score Primary issue – major mentions, not every mention Combination of noun phrase extraction and categorization Results – virtually 100%
Case Study Publishing Project: Reed Construction Data 700,000 Proposals – Wide Variation Process Proposals – extract data – 30-50 types Current Manual Process – Internal Teams Expensive and Slow Structure Variety of Unstructured Documents Generate Table of Contents Generate Sections and Capture Text Semi-automatic extract Key Information Save Time & Money, Flexible Hiring, New Offerings
Publishing Project: Example Rules Automated Table of Content
Publishing Project: Example Rules Key Data Extraction Bid Dates/Times Roles (Architect, Designer, etc.) – names and addresses, etc. Project Attributes – Cost, Invitation Number, Parking, etc. Some Easy, Some Hard – Address! Example: ARCHITECT: MICHEAL KIM ARCHITECTURE 1 HOLDEN STREET BROOKLINE, MA 02445 P: (617) 739-6925 F: (772) 325-2991 Technique – create broad and stable templates, variation in the text
Publishing Project: Example Rules Key Project Data
Publishing Project: Process & Approach
Smart Search: Metadata, Metadata, Metadata Basic Facets: Date, People, Organization, Content-Type Advanced Facets: Materials, Methods, Project Attributes, etc. Context dependent Deep personalization Selection of facets by role, community, task, content Smart Summarization Better conceptual description Complex summaries – key data, document sections, etc. Smart Search – beyond simple relevancy Next – Beyond Search - active agents – don’t need questions
Building on the Foundation: Applications Pronoun Analysis: Fraud Detection; Enron Emails Patterns of “Function” words reveal wide range of insights Function words = pronouns, articles, prepositions, conjunctions, etc. Used at a high rate, short and hard to detect, very social, processed in the brain differently than content words Areas: sex, age, power-status, personality – individuals and groups Lying / Fraud detection: Documents with lies have: Fewer, shorter words, fewer conjunctions, more positive emotion words More use of “if, any, those, he, she, they, you”, less “I” Current research – 76% accuracy in some context
Building on the Foundation: Social Media Beyond Simple Sentiment Beyond Good and Evil (positive and negative) Degrees of intensity, complexity of emotions and documents Importance of Context – around positive and negative words Rhetorical reversals – “I was expecting to love it” Issues of sarcasm, (“Really Great Product”), slanguage Essential – need full categorization and concept extraction New Taxonomies – Appraisal Groups – “not very good” Supports more subtle distinctions than positive or negative Emotion taxonomies - Joy, Sadness, Fear, Anger, Surprise, Disgust New Complex – pride, shame, confusion, skepticism
Building on the Foundation: Applications Behavior Prediction – Telecom Customer Service Problem – distinguish customers likely to cancel from mere threats Basic Rule / Intention (START_20, (AND, (DIST_7,"[cancel]", "[cancel-what-cust]"), (NOT,(DIST_10, "[cancel]", (OR, "[one-line]", "[restore]", “[if]”))))) Examples: customer called to say he will cancell his account if the does not stop receiving a call from the ad agency. cci and is upset that he has the asl charge and wants it off or her is going to cancel his act More sophisticated analysis of text and context in text Combine text analytics with Predictive Analytics and traditional behavior monitoring for new applications
Building on the Foundation: Current Applications Survey Analysis – Add analysis of free text Automated Essay Scoring – Second Generation Beyond words (polysyllabic) to meaning Story Telling – Data Heavy, Sports, Finance 90% of news machine written by 2025, books? Legal Review / eDiscovery TA- categorize and filter to smaller, more relevant set Payoff is big – One firm with 1.6 M docs – saved $2M Voice of the Customer / Employee / Voter Analysis of Blogs, Tweets, Social Networks Early Identify problems with products and services Customer Relationship & Brand Management, Fraud Detection
Smart Text : New Directions - Integration Deep Integration – Text Analytics New Forms of Rules – Combine Text Mining and Text Analytics Incorporate clusters – CLUSTER Operator Like SENTENCE but more flexible, dynamic More Dynamic Sections Build up from “Categorization” of sentences – based on co-reference Smaller units – Appraisal Taxonomies for Subjects, Build Larger Units Complex Units – Collections of Paragraphs based on meaning Sentence Level Sentiment Techniques for Subjects Smarter Relevancy – not frequency – develop new scoring Hybrid Machine-Human Where, When, and How Development, Tagging, Usage, Analytics, etc. How Get Best of Both?
Conclusions Text Analytics can feed/extend Big Data and Cognitive Science applications Discover structure in (un)structured text Apply text analytics to sections of document – new kinds of relevancy Creating multiple views into data inside text – smart search results – interactive (facets plus) Modular design – better search, new applications, Watson Future: Cognitive Computing: Learns, discover patterns based on context, highly integrated, meaning-based, highly interactive Text Analytics adds depth of meaning Future – Women, Fire, and Dangerous Things Text Analytics and Cognitive Science = Metaphor Analysis, deep language understanding, common sense?
Coming Soon! New Book coming: Text Analytics: Everything You Need to Know to Conquer Information Overload, Mine Social Media for Real Value, and Turn Big Text into Big Data November
Questions? Tom Reamy tomr@kapsgroup.com KAPS Group Knowledge Architecture Professional Services http://www.kapsgroup.com