Download presentation
Presentation is loading. Please wait.
Published byJustin Shepherd Modified over 6 years ago
1
Combining Taxonomy, Ontology, Text, and Data A Deep Text Approach
Tom Reamy Chief Knowledge Architect KAPS Group Author: Deep Text
2
Agenda Introduction & What is Text Analytics?
Deep Text vs, Deep Learning Taxonomy + Text Analytics = Catonomy Approaches to Search Faceted Navigation on steroids Case Study – DOT Environment, Process, Architecture Semantic Model Rule Development Results Conclusion
3
Introduction: KAPS Group
Network of Consultants and Partners – “Hiring” Services: Strategy – Text Analytics, Social Media, Integration Text Analytics Smart Start, Next Level – Audit, Evaluation, Pilot Development - Taxonomy/Text Analytics, Social Media Partners –Synaptica, SAS, IBM, Smart Logic, Expert Systems, Clarabridge, Lexalytics, BA Insight, BiText Clients: Genentech, Novartis, Northwestern Mutual Life, Financial Times, Hyatt, Home Depot, Harvard, British Parliament, Battelle, Amdocs, FDA, GAO, World Bank, Dept. of Transportation, etc. Presentations, Articles, White Papers – Book –Deep Text: Using Text Analytics to Conquer Information Overload, Get Real Value from Social Media, and Add Big(ger) Text to Big Data
4
A treasure trove of technical detail, likely to become a definitive source on text analytics – Kirkus Reviews Information Today Table
5
Deep Text Approach What is Text Analytics?
Text Mining – NLP, statistical, predictive, machine learning Extraction – entities, concepts, events Known and unknown, catalogs and rules/patterns Fact extraction – entities and their relationships Sentiment Analysis Positive and Negative expressions – dictionaries & rules Auto-categorization Training sets, Terms Rules: Boolean - AND, OR, NOT Advanced – DIST(#), ORDDIST#, PARAGRAPH, SENTENCE Build rules based, not simply Bag of Individual Words Disambiguation - Identification of objects, events, context
6
SLA – Categorization and Annotation Categorization Basics
Representation of Domain knowledge – taxonomy, ontology Categorization – Know What Most basic to human cognition Most difficult to do with software No single correct categorization Women, Fire, and Dangerous Things Basic Level Categories – Mammal-Dog-Golden Retriever Monkey, Panda, Banana Sentiment Analysis to Expertise Analysis(KnowHow) Know How, skills, “tacit” knowledge
7
SLA – Categorization and Annotation Entity Extraction Process
Entities and Facts – Big(ger) Text feeds Big Data Search – Facets Facet Design – from Knowledge Audit, K Map Find and Convert catalogs: Organization – internal resources People – corporate yellow pages, HR Scripts to convert catalogs – programming resource Linked Data – fast, less control Build initial rules – follow categorization process Differences – scale, threshold – application dependent Recall – Precision – balance set by application Issue – disambiguation – Ford company, person, car
8
SLA Deep Text vs. Deep Learning
Two Schools – Language Rules vs. Math / Patterns Depth & Intelligence vs. Speed & Power Deep Learning Neural Networks – from 1980’s, new = size and speed Strongest in areas like image recognition, fact lookup Weakest – concepts, subjects, deep language, metaphors, etc. Deep Text – Language, concepts, symbols Categorization and Beyond Adds smarts to everything else Rules = higher accuracy – 98% - Rules brittle?
9
SLA Deep Text vs. Deep Learning
Deep Learning is a Dead End - accuracy – 60-70% Black Box – don’t know how to improve except indirect manipulation of input – “We don’t know how or why it works” Domain Specific, tricks not deep understanding No common sense and no strategy to get there Major – loss of quality – who is training who? Extra Benefits of a Deep Text Approach Multiple Info Apps Multiple search techniques
10
Deep Text Appraoch Text Analytics and Taxonomy = Catonomy
A taxonomy with built-in categorization to apply the taxonomy Works for all types of taxonomies Simple one level to complex 10,000’s of nodes Subject taxonomy to action taxonomy to sentiment taxonomy to thing taxonomy Catology Too! Ontology = multiple, faceted taxonomies and how they are related Limits – Cost, Need for customization Productization = lower costs Templates – logic + vocabulary Sectionize documents
11
Beyond Automatic Tagging Metadata – Tagging – Mind the Gap
How do you bridge the gap – taxonomy to documents? Tagging documents with taxonomy nodes is tough And expensive – central or distributed Library staff – Too limited, narrow bottleneck Authors – Experts in the subject matter, terrible at categorization Intra and Inter inconsistency, “intertwingleness” Choosing tags from taxonomy – complex task Resistance – not their job, cognitively difficult = non-compliance Hybrid Model: Publish Document -> Text Analytics analysis -> suggestions for categorization, entities, metadata - > present to author
12
Basic Application: Search Multi-dimensional and Smart
Faceted Navigation has become the basic/ norm Facets require huge amounts of metadata Entity / noun phrase extraction is fundamental Automated with disambiguation (through categorization) Ontology – Data and facet structure and relationship Taxonomy = Subject/Topics Clusters and Tag Clouds – discovery & exploration Auto-categorization – aboutness, subject facets This is still fundamental to search experience InfoApps only as good as fundamentals of search
13
Deep Text Search Adding Structure to Unstructured Content
Documents are not unstructured – variety of structures Sections – Specific - “Abstract” to Function “Evidence” Multiple Text Indicators – Categorization Rule Multiple style indicators – Font, Bold, etc. Corpus – document types/purpose Textual complexity, level of generality, Other characterization Clusters and machine learning – at section level, not document Future = Combine machine learning and rules Add taxonomy to machine learning Add learning to text analytics
14
SLA Enterprise Info Apps
Focus on business value, cost cutting, new revenues Applications require sophisticated rules, not just categorization by similarity Business Intelligence – products, competitors It is a growing field with revenues of $13.1 billion in 2015. Financial Services - Combine structured transaction data (what) with unstructured text (why) Customer Relationship Management, Fraud Detection Stock Market Prediction , eDiscovery, Text Assisted Review, HR resumes, automatic summaries, Expertise analysis, etc., etc.
15
SLA Social Media Applications
Voice of the Customer-Employee-Voter Detection of a recurring problem categorized by subject, customer, client, product, parts, or by representative. Subscriber mood before and after a call – and why Political – conservative and liberal minds/texts Disgust, shame, cooperation, openness Behavior Prediction – customer likely to cancel Fraud detection – lies in text have different patterns Areas: sex, age, power-status, personality
16
SLA: Case Study – Text and Data General Environment
DOT – Adding text analytics to SharePoint Search Fragmented environment – 51 DOTs, 5-15 Districts Project is main organizational unit – wanted cross-project capability Candidate for Worst Information Environment No standardization, no content management CM = s with scanned forms as attachments SharePoint – but – no metadata, no standards, no taxonomy Formal Thesaurus – did not apply to business categories
17
SLA: Case Study – Text and Data Objectives
Demonstrate and validate concepts and methods to improve findability – with text analytics Assess effort required and transferability to other DOTs Document a case study for inclusion in the guidance Create resources that other DOTs can build upon - e.g. categorization rules for common DOT terms, faceted search navigation design Normal – train people in the DOT
18
SLA: Case Study – Text and Data Methods & Activities
Research foundation – stakeholders, users, needs and behaviors – search – variety of search needs from single targets to research topics to FOIA requests, content analysis Content Types :Obtained sample content from 3 districts – from combination of SharePoint and shared drives Daily Work Reports (C-84s) – 2,000+ Work Orders/Change Orders (C-10s) – 1,000+ Source of Materials Forms (C-25s) – 1,000+ Variety of formats: PDF, MHT, Word, Excel, RTF, MSG / OCR
19
Big Picture – Architecture for Enterprise Search Solution
Content Repositories SharePoint Ontology and Text Analytics Rules Publisher Shared Drives Integrated Content (Virtual or Physical) Indexer ECM System Assign Tag “Connectors” Archive Indexed Content Search Engine Info Apps
21
Solution Development Semantic Model – Elements (“facets”)
Content Type Materials Source of Materials Equipment DWR, Pay Items Work Order, Work Order Category Work Order-Related Work Issue Project Profile Drainage Project No/Contract No/UPC Utility Location: District, Jurisdiction, Route Weather Plan-Related Type of Work Work Zone-Related Award Amount Manufacturers and Suppliers Contractors
22
Discussion Tom Reamy tomr@kapsgroup.com KAPS Group
Knowledge Architecture Professional Services
25
SLA: Case Study – Text and Data Rule Development
Development Process: Iterative Start with content – preferably categorized by human Option if existing taxonomy – use for auto-rule Simple lists of terms/phrases – build until maximum recall Refine the rules for precision – important terms, only in targets % Add logic for enhanced accuracy – 80-90% Add new content – accuracy drops, refine until target accuracy Repeat until “done” Developer – text analysts – typical role is librarian, can be power user-subject matter expert with “logical mind”
26
SLA: Case Study – Text and Data Advantages of Rules
Power – type “drainage” get too many and too few Documents that mention drainage but in passing Systemic text – part of standard – example work order categories – documents have list of all of them Documents that use other words – “ponding”, etc. Build templates – build rich text, separate logic from text Utilize semi-structure – parts of documents – as text markers Work Orders – LOCATION AND DESCRIPTION OF PROPOSED WORK Generalize to other types of Work Orders, other DOTs
27
What is a Catonomy
28
What is a Catonomy
32
SLA: Case Study – Text and Data Rule Development – Work Order Category
Programming language: AND, OR, NOT PARAGRAPH EXPRESSION (regular – use for date ranges) Include both text and code: ADD “Additional work not originally planned” / Plus misspellings – Utilitity Both codes and text show up in other places in the documents Use “Category: ADD – removed many false positives, but also missed many – pdf formatting Rule – Category: within 2 paragraphs of ADD, Text – captures all without introducing false positives Rule 2: 4 paragraphs after target text
33
Test and Evaluation Faceted Search Design
Collected metrics comparing “Vanilla” search using FAST with faceted search on auto-classified documents Examples: Recall of Work Orders for UPC (in Top 30 Results) Precision of work orders related to utility issues (for Top 20 Results) Example 1: Recall was of relevant documents in the Top 30 results. 24 of the 26 known relevant documents were found in the faceted search, compared to 4 of 26 for the vanilla search. Precision of this search was also high: 100% of Top 20 results were relevant for the faceted search (20 of 20), while 80% of all results were relevant in the vanilla search (4 of 5 returned results). Example 2: Both the vanilla search and the faceted search returned a relevant document as the first result. Only 13 results were needed to find 10 relevant documents in the faceted search, while at least 30 (and likely many more) would be needed to find 10 relevant documents in the vanilla search.
34
SLA: Case Study – Text and Data Build on Solution
Transferability – major issue with 51 separate organizations Tracked effort level to develop solution for VDOT Initial test – simple substitution of WSDOT content Accuracy was very poor Solution was very simple Simple text substitution – section indicators WSDOT versions – district, some model changes WSDOT – Specific facet values – materials, etc. Estimate of 1/10 effort to adapt to new DOT’s
35
Conclusions Catonomy/Catology Projects – start with deep research
Content, Users, Needs and Behaviors, Technology Text Analytics is a Platform – Search and Applications FOIA Requests – all projects in which a model Guardrail from Supplier Y was installed – when, location (Route 29) Taxonomy + CM + TA + Policy and Process + People Search doesn’t have to suck! Faceted Search Works – But Requires Metadata+ Counting words will never work Utilizing Poly-Structure is solution to unstructured text Text analytics methods – all of the above Machine Learning and Rules Text and Data, Taxonomy and Ontology
36
Additional Information
Conferences: Sentiment Symposium – 6/27-28-New York Taxonomy Boot Camp – 10/17-18-London Internet Librarian – 10/22-25-Monterey Taxonomy Boot Camp – 11/6-7 - DC Text Analytics Forum – 11/8-9 –DC KAPS Group web site – links to resources, partners Best = Buy the book!
37
Questions? :
38
Deep Text Reviews “A treasure trove of technical detail, likely to become a definitive source on text analytics” – Kirkus Reviews “A remarkably authoritative deep-dive into a field that will be brand new to many and eye-opening for all.” – Kirkus Reviews “One of the main strengths of this book, though, is that even when its content is highly technical, it’s so well-organized and tightly written that it’s quite enjoyable to read.” – Kirkus Reviews
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.