Smart Text How to Turn Big Text into Big Data Tom Reamy Chief Knowledge Architect KAPS Group Program Chair – Text Analytics World.

Slides:



Advertisements
Similar presentations
Top Tips Enterprise Content Management Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services
Advertisements

Metadata Strategies Alternatives for creating value from metadata Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services.
Improving Navigation and Findability Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services
Taxonomies, Lexicons and Organizing Knowledge Wendi Pohs, IBM Software Group.
Beyond Sentiment New Dimensions for Social Media A Panel Discussion of Trends and Ideas Dave Hills, Twelvefold Media Mike Lazarus, Atigeo, LLC Moderator:
Copyright © 2012, SAS Institute Inc. All rights reserved. #analytics2012 Quick Start for Text Analytics Tom Reamy Chief Knowledge Architect KAPS Group.
Enterprise Information Architecture A Platform for Integrating Your Organization’s Information and Knowledge Activities Tom Reamy Chief Knowledge Architect.
Faceted Navigation: Search and Browse Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services
Taxonomy Development Case Studies
Innovation in Search? Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services
Knowledge Architecture Process & Case Studies Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services
Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services
Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services
Taxonomy Boot Camp Panel Text Analytics Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services
Improving Search for Discovery Tom Reamy Chief Knowledge Architect KAPS Group Program Chair – Text Analytics World Knowledge Architecture Professional.
Automatic Facets: Faceted Navigation and Entity Extraction Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services.
Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 Text Analytics Evaluation A Case Study: Amdocs Tom Reamy Chief Knowledge Architect.
Beyond Sentiment Mining Social Media A Panel Discussion of Trends and Ideas Marie Wallace, IBM Marcello Pellacani, Expert System Fabio Lazzarini, CRIBIS.
Enterprise Semantic Infrastructure Workshop Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services
Facets and Faceted Navigation Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services
Expanding Enterprise Roles for Librarians Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services
Text Analytics Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services
Selecting Taxonomy Software Who, Why, How Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services
Building a Foundation for Info Apps Tom Reamy Chief Knowledge Architect KAPS Group Program Chair – Text Analytics World Knowledge Architecture Professional.
Enterprise Search/ Text Analytics Evaluation Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Text Analytics And Text Mining Best of Text and Data
Best of All Worlds Text Analytics and Text Mining and Taxonomy Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services.
New Directions in Social Media Tom Reamy Chief Knowledge Architect KAPS Group
SemTech Text Analytics Evaluation Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services
Integrating an Enterprise Taxonomy with Local Variations Tom Reamy Chief Knowledge Architect KAPS Group Program Chair – Text Analytics World Knowledge.
Applying Semantics to Search Text Analytics Tom Reamy Chief Knowledge Architect KAPS Group Enterprise Search Summit New York.
Text Analytics Workshop Applications Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services
Taxonomy and Social Media Social Taxonomies Tom Reamy Chief Knowledge Architect KAPS Group Program Chair – Text Analytics World Knowledge Architecture.
Content Categorization Tools Taxonomies & Technologies for Infrastructure Solutions Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture.
Text Analytics Summit Text Analytics Evaluation Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services
Text Analytics Software Choosing the Right Fit Tom Reamy Chief Knowledge Architect KAPS Group Text Analytics World October 20.
New Directions in Social Media Tom Reamy Chief Knowledge Architect KAPS Group
Metadata and Taxonomies The Best of Both Worlds Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services
Integrating an Enterprise Taxonomy with Local Variations Tom Reamy Chief Knowledge Architect KAPS Group Taxonomy Boot Camp.
Text Analytics Mini-Workshop Quick Start Tom Reamy Chief Knowledge Architect KAPS Group Program Chair – Text Analytics World Knowledge Architecture Professional.
Electronic Scriptorium, Ltd. AIIM Minnesota Chapter Metadata and Taxonomy Presentation Copyright Electronic Scriptorium, Ltd. All rights reserved, 1991.
Enterprise Semantic Infrastructure Workshop Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services
Folksonomy Folktales Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services
Selecting Taxonomy Software Who, Why, How Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services
Text Analytics Workshop Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services
Advanced Semantics and Search Beyond Tag Clouds and Taxonomies Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services.
Text Analytics for Search Applications Workshop Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services
Text Analytics A Tool for Taxonomy Development Tom Reamy Chief Knowledge Architect KAPS Group Program Chair – Text Analytics World Knowledge Architecture.
Text Analytics Workshop Applications Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.
Semantic Wiki: Automating the Read, Write, and Reporting functions Chuck Rehberg, Semantic Insights.
Innovative Novartis Knowledge Center
Taxonomy and Text Analytics Case Studies Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services
Semantic Web Technologies Readings discussion Research presentations Projects & Papers discussions.
Taxonomy Development An Infrastructure Model Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services
Deep Text New Approaches in Text Analytics and Knowledge Organization Tom Reamy Chief Knowledge Architect KAPS Group Author: Deep.
Text Analytics World Future Directions of Text Analytics: Smarter, Bigger, and Better Tom Reamy Chief Knowledge Architect KAPS Group Program Chair – Text.
Text Analytics Webinar
Tom Reamy Chief Knowledge Architect KAPS Group
Tom Reamy Chief Knowledge Architect KAPS Group
Combining Taxonomy, Ontology, Text, and Data A Deep Text Approach
Enterprise Social Networks A New Semantic Foundation
Program Chair: Tom Reamy Chief Knowledge Architect
Taxonomies, Lexicons and Organizing Knowledge
Social Knowledge Mining
Using Text Analytics to Spot Fake News
Text Analytics Workshop: Introduction
Program Chair: Tom Reamy Chief Knowledge Architect
Expertise Location Basic Level Categories
Presentation transcript:

Smart Text How to Turn Big Text into Big Data Tom Reamy Chief Knowledge Architect KAPS Group Program Chair – Text Analytics World Taxonomy Boot Camp, KMWorld: Washington DC Internet Librarian: Monterey, CA

2 KAPS Group: General  Knowledge Architecture Professional Services – Network of Consultants  Partners – Expert System, SAS, SAP, IBM, FAST, Smart Logic, Concept Searching, Attensity, Clarabridge, Lexalytics,  Strategy – IM & KM - Text Analytics, Social Media, Integration  Services: – Taxonomy/Text Analytics development, consulting, customization – Text Analytics Fast Start – Audit, Evaluation, Pilot – Social Media: Text based applications – design & development  Clients: – Genentech, Novartis, Northwestern Mutual Life, Financial Times, Hyatt, Home Depot, Harvard Business Library, British Parliament, Battelle, Amdocs, FDA, GAO, etc.  Applied Theory – Faceted taxonomies, complexity theory, natural categories, emotion taxonomies Presentations, Articles, White Papers –

3 Agenda  Introduction: Big Text and Big Data  Pharma: Semantic Search Application – Project Components & Approach – Extraction Rules  Publishing: Processing 700K Proposals – Adding Structure to Unstructured Text – Text into Data  Conclusions

4 Big Text and Big Data  Big Text is Bigger than Big Data – 80% -> 90% of business information (Social Media)  Big Data tells you WHAT – Smart Text tells you WHY  Big Data – Data Munging = 50-80% of Data Scientist Time – Variety of Formats // Ambiguity of Human Language  Ontology / Fact Extraction – Pulmonary ISA Disease – Chronic obstructive pulmonary disease, obstructive pulmonary disease, Copd, copd, COPD, Asthma (Asthema), Emphysema, etc., etc.  Semi-Automatic Hybrid Solutions – AI not here yet (again)

5 Pharma: Project  Agile Methodology  Goal – evaluate text analysis technologies ability to: – Replace manual annotation of scientific documents – automated or semi-automated – Discover new entities and relationships – Provide users with self-service capabilities  Goal – feasibility and effort level

6 Components – Technology, Resources  Cambridge Semantics, Linguamatics, SAS Enterprise Content Categorization – Initial integration – passing results as XML  Content – scientific journal articles  Taxonomy – Mesh – select small subset  Access to a “customer” – critical for success

7 Three rounds - Iterations  Visualization – faceted search, sort by date, author, journal – Cambridge Semantics  Round 1 – PDF from their database – Needed to create additional structure and metadata – No such thing as unstructured content  Round 2 & 3 – XML with full metadata from PubMed  Entity Recognition – Species, Document Type, Study Type, Drug Names, Disease Names, Adverse Events

8 Components & Approach  Rules or sample documents? – Need more precision and granularity than documents can do – Training sets – not as easy as thought  First Rules – text indicators to define sections of the document – Objectives, Abstract, Purpose, Aim – all the “same” section – Experiment – clusters / vocabulary to define section  Separate logic of the rules from the text – Stable rules, changing text  Scores – relevancy with thresholds – Not just frequency of words

9 Document Type Rules  (START_2000, (AND, (OR, _/article:"[Abstract]", _/article:"[Methods]“, _/article:"[Objective]",  _/article:"[Results]", _/article:"[Discussion]“, (OR,  _/article:"clinical trial*", _/article:"humans",  (NOT, (DIST_5, (OR,_/article:"approved", _/article:"safe", _/article:"use", _/article:"animals"),  Clinical Trial Rule:  If the article has sections like Abstract or Methods  AND has phrases around “clinical trials / Humans” and not words like “animals” within 5 words of “clinical trial” words – count it and add up a relevancy score

10 Rules for Drug Names and Diseases  Primary issue – major mentions, not every mention – Combination of noun phrase extraction and categorization – Results – virtually 100%  Taxonomy of drug names and diseases  Capture general diseases like thrombosis and specific types like deep vein, cerebral, and cardiac  Combine text about arthritis and synonyms with text like “Journal of Rheumatology”

11

12 Rules for Drug Names and Diseases  (OR, _/article/title:"[clonidine]",  (AND, _/article/mesh:"[clonidine]",_/article/abstract:"[clonidine]"),  (MINOC_2, _/article/abstract:"[clonidine]")  (START_500, (MINOC_2,"[clonidine]")))  Means – any variation of drug name in title – high score  Any variation in Mesh Keywords AND in abstract – high score  Any variation in Abstract at least 2x – good score  Any variation in first 500 words at least 2x – suspect

13 Rules for Drug Names and Diseases  Results: – Wide Range by type % recall and precision  Focus mostly on precision – difficult to test recall  One deep dive area indicated that 90%+ scores for both precision and recall could be built with moderate level of effort  Not linear effort – 30% accuracy does not mean 1/3 done

Conclusion  Project was a success!  Useful results – as defined by the customer  Reasonable and doable effort level – both for initial development and maintenance  Essential Success Factors – Rules not documents, training sets (starting point) – Full platform for disambiguation of noun phrase extraction, major-minor mention – Separation of logic and text  “Semantic” Search works! – If you do it smart! 14

Publishing Project: Reed Construction Data  700,000 Proposals – Wide Variation  Process Proposals – extract data – types  Current Manual Process – Internal Teams – Expensive and Slow  Structure Variety of Unstructured Documents – Generate Table of Contents – Generate Sections and Capture Text  Extract Key Information  Save Time & Money, Flexible Hiring, New Offerings 15

Publishing Project: Components: Technology, Resources  Initial Attempt – failed target, too expensive to complete  KAPS Group and SAS – Enterprise Content Categorization – Team of 4 – mostly part time  Reed Data Resources – 3 part time +, Current team of proposal processors – develop test documents  4 Months – majority of time/effort on Key Data Extraction  Sections – by Construction codes & text, Automated Table of Contents 16

Publishing Project: Example Rules Automated Table of Content 17

Publishing Project: Example Rules Automated Table of Content  ( AND, (OR,  (ORD,"[SectionHeaderTags]","[Division01B_RegEx]","[TechnicalSpecPhrases]",  (ORDDIST_3,"[SectionBodyPart]","[SectionBodyDesc]"  )),  (ORD,"[Division01B_RegEx]","[TechnicalSpecPhrases]",  (ORDDIST_3,"[SectionBodyPart]","[SectionBodyDesc]"  __Division01BRegEx  00[0-9][0-9][0-9],  00[ _-]?[0-9][0-9][ _-]?[0-9][0-9],  00[ _-]?[0-9][0-9][ _-]?[0-9][0-9][\.][0-9][0-9], ))))  Abandonment, Abatement, Abbreviations, Above-Grade, Aboveground, Abrasion-Resistant,  Abrasive, Absorption, AC, Acceleration, etc - ~2,000 terms  Section Header Tags – “Section, Division, Document” 18

Publishing Project: Example Rules Key Data Extraction  Bid Dates/Times  Roles (Architect, Designer, etc.) – names and addresses, etc.  Project Attributes – Cost, Invitation Number, Parking, etc.  Some Easy, Some Hard – Address!  Example  ARCHITECT:  MICHEAL KIM ARCHITECTURE  1 HOLDEN STREET  BROOKLINE, MA  P: (617)  F: (772)

Publishing Project: Process & Approach 20

Publishing Project: Example Rules Key Project Data 21

Publishing Project: Example Rules Key Project Data 22

Conclusion: Lessons Learned  Development requires lots of content, testers, regular meetings  Best Pattern Rule Development = develop a few rules to production level, then adapt to other areas  Hybrid Solutions are best (AI not here yet)  Biggest Problem = Human Creativity  Best Solution = Human Creativity  But – successful project!  Foundation laid for Semi-automated text processing, new data  Next Steps – refine, add, refine, new, refine, refine 23

Summary  Text Analytics: Platform & Foundation for Applications  Semantic Search and (Semi)-Automated Business Processes  AND – Sentiment Analysis-Social Media, Fraud Detection, eDiscovery, Expertise location & analysis, behavior prediction  Data/Fact Extraction can feed/extend Big Data and Semantic Technology applications  Interested? – Text Analytics World, San Francisco March 30-April 1 (Call for Speakers Now)-textanalyticsworld.com  New Book coming: Text Analytics: Everything You Need to Know to Conquer Information Overload, Mine Social Media for Real Value, and Turn Big Text into Big Data 24

Questions? Tom Reamy KAPS Group Knowledge Architecture Professional Services March 30-April 1, San Francisco