Presentation is loading. Please wait.

Presentation is loading. Please wait.

Smart Text How to Turn Big Text into Big Data Tom Reamy Chief Knowledge Architect KAPS Group Program Chair – Text Analytics World.

Similar presentations


Presentation on theme: "Smart Text How to Turn Big Text into Big Data Tom Reamy Chief Knowledge Architect KAPS Group Program Chair – Text Analytics World."— Presentation transcript:

1 Smart Text How to Turn Big Text into Big Data Tom Reamy Chief Knowledge Architect KAPS Group http://www.kapsgroup.com Program Chair – Text Analytics World Taxonomy Boot Camp, KMWorld: Washington DC Internet Librarian: Monterey, CA

2 2 KAPS Group: General  Knowledge Architecture Professional Services – Network of Consultants  Partners – Expert System, SAS, SAP, IBM, FAST, Smart Logic, Concept Searching, Attensity, Clarabridge, Lexalytics,  Strategy – IM & KM - Text Analytics, Social Media, Integration  Services: – Taxonomy/Text Analytics development, consulting, customization – Text Analytics Fast Start – Audit, Evaluation, Pilot – Social Media: Text based applications – design & development  Clients: – Genentech, Novartis, Northwestern Mutual Life, Financial Times, Hyatt, Home Depot, Harvard Business Library, British Parliament, Battelle, Amdocs, FDA, GAO, etc.  Applied Theory – Faceted taxonomies, complexity theory, natural categories, emotion taxonomies Presentations, Articles, White Papers – http://www.kapsgroup.comhttp://www.kapsgroup.com

3 3 Agenda  Introduction: Big Text and Big Data  Pharma: Semantic Search Application – Project Components & Approach – Extraction Rules  Publishing: Processing 700K Proposals – Adding Structure to Unstructured Text – Text into Data  Conclusions

4 4 Big Text and Big Data  Big Text is Bigger than Big Data – 80% -> 90% of business information (Social Media)  Big Data tells you WHAT – Smart Text tells you WHY  Big Data – Data Munging = 50-80% of Data Scientist Time – Variety of Formats // Ambiguity of Human Language  Ontology / Fact Extraction – Pulmonary ISA Disease – Chronic obstructive pulmonary disease, obstructive pulmonary disease, Copd, copd, COPD, Asthma (Asthema), Emphysema, etc., etc.  Semi-Automatic Hybrid Solutions – AI not here yet (again)

5 5 Pharma: Project  Agile Methodology  Goal – evaluate text analysis technologies ability to: – Replace manual annotation of scientific documents – automated or semi-automated – Discover new entities and relationships – Provide users with self-service capabilities  Goal – feasibility and effort level

6 6 Components – Technology, Resources  Cambridge Semantics, Linguamatics, SAS Enterprise Content Categorization – Initial integration – passing results as XML  Content – scientific journal articles  Taxonomy – Mesh – select small subset  Access to a “customer” – critical for success

7 7 Three rounds - Iterations  Visualization – faceted search, sort by date, author, journal – Cambridge Semantics  Round 1 – PDF from their database – Needed to create additional structure and metadata – No such thing as unstructured content  Round 2 & 3 – XML with full metadata from PubMed  Entity Recognition – Species, Document Type, Study Type, Drug Names, Disease Names, Adverse Events

8 8 Components & Approach  Rules or sample documents? – Need more precision and granularity than documents can do – Training sets – not as easy as thought  First Rules – text indicators to define sections of the document – Objectives, Abstract, Purpose, Aim – all the “same” section – Experiment – clusters / vocabulary to define section  Separate logic of the rules from the text – Stable rules, changing text  Scores – relevancy with thresholds – Not just frequency of words

9 9 Document Type Rules  (START_2000, (AND, (OR, _/article:"[Abstract]", _/article:"[Methods]“, _/article:"[Objective]",  _/article:"[Results]", _/article:"[Discussion]“, (OR,  _/article:"clinical trial*", _/article:"humans",  (NOT, (DIST_5, (OR,_/article:"approved", _/article:"safe", _/article:"use", _/article:"animals"),  Clinical Trial Rule:  If the article has sections like Abstract or Methods  AND has phrases around “clinical trials / Humans” and not words like “animals” within 5 words of “clinical trial” words – count it and add up a relevancy score

10 10 Rules for Drug Names and Diseases  Primary issue – major mentions, not every mention – Combination of noun phrase extraction and categorization – Results – virtually 100%  Taxonomy of drug names and diseases  Capture general diseases like thrombosis and specific types like deep vein, cerebral, and cardiac  Combine text about arthritis and synonyms with text like “Journal of Rheumatology”

11 11

12 12 Rules for Drug Names and Diseases  (OR, _/article/title:"[clonidine]",  (AND, _/article/mesh:"[clonidine]",_/article/abstract:"[clonidine]"),  (MINOC_2, _/article/abstract:"[clonidine]")  (START_500, (MINOC_2,"[clonidine]")))  Means – any variation of drug name in title – high score  Any variation in Mesh Keywords AND in abstract – high score  Any variation in Abstract at least 2x – good score  Any variation in first 500 words at least 2x – suspect

13 13 Rules for Drug Names and Diseases  Results: – Wide Range by type -- 70-100% recall and precision  Focus mostly on precision – difficult to test recall  One deep dive area indicated that 90%+ scores for both precision and recall could be built with moderate level of effort  Not linear effort – 30% accuracy does not mean 1/3 done

14 Conclusion  Project was a success!  Useful results – as defined by the customer  Reasonable and doable effort level – both for initial development and maintenance  Essential Success Factors – Rules not documents, training sets (starting point) – Full platform for disambiguation of noun phrase extraction, major-minor mention – Separation of logic and text  “Semantic” Search works! – If you do it smart! 14

15 Publishing Project: Reed Construction Data  700,000 Proposals – Wide Variation  Process Proposals – extract data – 30-50 types  Current Manual Process – Internal Teams – Expensive and Slow  Structure Variety of Unstructured Documents – Generate Table of Contents – Generate Sections and Capture Text  Extract Key Information  Save Time & Money, Flexible Hiring, New Offerings 15

16 Publishing Project: Components: Technology, Resources  Initial Attempt – failed target, too expensive to complete  KAPS Group and SAS – Enterprise Content Categorization – Team of 4 – mostly part time  Reed Data Resources – 3 part time +, Current team of proposal processors – develop test documents  4 Months – majority of time/effort on Key Data Extraction  Sections – by Construction codes & text, Automated Table of Contents 16

17 Publishing Project: Example Rules Automated Table of Content 17

18 Publishing Project: Example Rules Automated Table of Content  ( AND, (OR,  (ORD,"[SectionHeaderTags]","[Division01B_RegEx]","[TechnicalSpecPhrases]",  (ORDDIST_3,"[SectionBodyPart]","[SectionBodyDesc]"  )),  (ORD,"[Division01B_RegEx]","[TechnicalSpecPhrases]",  (ORDDIST_3,"[SectionBodyPart]","[SectionBodyDesc]"  __Division01BRegEx  00[0-9][0-9][0-9],  00[ _-]?[0-9][0-9][ _-]?[0-9][0-9],  00[ _-]?[0-9][0-9][ _-]?[0-9][0-9][\.][0-9][0-9], ))))  Abandonment, Abatement, Abbreviations, Above-Grade, Aboveground, Abrasion-Resistant,  Abrasive, Absorption, AC, Acceleration, etc - ~2,000 terms  Section Header Tags – “Section, Division, Document” 18

19 Publishing Project: Example Rules Key Data Extraction  Bid Dates/Times  Roles (Architect, Designer, etc.) – names and addresses, etc.  Project Attributes – Cost, Invitation Number, Parking, etc.  Some Easy, Some Hard – Address!  Example  ARCHITECT:  MICHEAL KIM ARCHITECTURE  1 HOLDEN STREET  BROOKLINE, MA 02445  P: (617) 739-6925  F: (772) 325-2991 19

20 Publishing Project: Process & Approach 20

21 Publishing Project: Example Rules Key Project Data 21

22 Publishing Project: Example Rules Key Project Data 22

23 Conclusion: Lessons Learned  Development requires lots of content, testers, regular meetings  Best Pattern Rule Development = develop a few rules to production level, then adapt to other areas  Hybrid Solutions are best (AI not here yet)  Biggest Problem = Human Creativity  Best Solution = Human Creativity  But – successful project!  Foundation laid for Semi-automated text processing, new data  Next Steps – refine, add, refine, new, refine, refine 23

24 Summary  Text Analytics: Platform & Foundation for Applications  Semantic Search and (Semi)-Automated Business Processes  AND – Sentiment Analysis-Social Media, Fraud Detection, eDiscovery, Expertise location & analysis, behavior prediction  Data/Fact Extraction can feed/extend Big Data and Semantic Technology applications  Interested? – Text Analytics World, San Francisco March 30-April 1 (Call for Speakers Now)-textanalyticsworld.com  New Book coming: Text Analytics: Everything You Need to Know to Conquer Information Overload, Mine Social Media for Real Value, and Turn Big Text into Big Data 24

25 Questions? Tom Reamy tomr@kapsgroup.com KAPS Group Knowledge Architecture Professional Services http://www.kapsgroup.com www.TextAnalyticsWorld.comwww.TextAnalyticsWorld.com March 30-April 1, San Francisco


Download ppt "Smart Text How to Turn Big Text into Big Data Tom Reamy Chief Knowledge Architect KAPS Group Program Chair – Text Analytics World."

Similar presentations


Ads by Google