A journey into Text Analytics John McConnell Analytical People ASC Winchester 7th September 2013 © analytical-people 2013.

Slides:



Advertisements
Similar presentations
Tricks and Tips in Word Yves Tkaczyk (
Advertisements

© Copyright 2007 Exempler Telecom Test Automation System Exempler - We pride ourselves with providing lightweight robust engineering solutions.
B2PDF b2pdf is the new and innovative release of our powerful command line tool for PDF customization b2pdf is a robust stand alone PDF file generation.
Dr Andy Pryke - The Data Mine Ltd An Introduction to R Free software for repeatable statistics, visualisation and modeling Dr Andy Pryke, The Data Mine.
Wiki Use Case: Managing Team Mtg Agendas, Minutes, & Tasks Sean Murphy, SKMurphy Inc. Lunch & Learn (People-On-The-Go )
This document contains information and data that AAUM considers confidential. Any disclosure of Confidential Information to, or use of it by any other.
CULTUREGRAMS CONCISE, RELIABLE, UP-TO-DATE COUNTRY REPORTS DELIVERING THE WORLD… TO YOU. September 2013.
3rd International Digital Curation Conference Washington, DC, Dec 2007 Paper Presentations: Interoperability, Metadata & Standards Data Documentation Initiative:
© 2008 EBSCO Information Services SUSHI, COUNTER and ERM Systems An Update on Usage Standards Ressources électroniques dans les bibliothèques électroniques.
DDI for the Uninitiated ACCOLEDS /DLI Training: December 2003 Ernie Boyko Statistics Canada Chuck Humphrey University of Alberta.
Other Web Application Development Technologies. PHP.
IAC (ACCESS INTERFACE CORPUS) DEVELOPED BY BARCELONA MEDIA & UNIVERSITAT POMPEU FABRA TONI BADIA (BARCELONA MEDIA - UNIVERSITAT POMPEU FABRA) JUDITH DOMINGO.
Standardizing Usage Statistics Requests with SUSHI Theodore Fons Senior Product Manager Innovative Interfaces.
A Guide to INCTR s Portal Enhancing international communication in the service of global cancer control.
MATLAB and Scilab Comparison
Thesaurus Management and User-Friendliness: a contradiction? Helmut Nagy Semantic Web Company
Warwick Bailey, Director Icodeon Ltd Cambridge, UK.
Data Mining with R/ORE Minming Duan. 2 iTech Solution Profile Agenda R/ORE Overview 1 XML output generation using SQL 4 Integration with IBP and BIEE.
Data Mining and Text Analytics Advertising Laura Quinn.
Data Mining and Text Analytics By Saima Rahna & Anees Mohammad Quranic Arabic Corpus.
Data Mining and Predictive Analytics Toolkit December 2013, Jakub Miarka, University of Leeds.
RESEARCH SKILLS FOR FINAL YEAR SCHOOL OF MANAGEMENT STUDENTS.
Nokia Technology Institute Natural Partner for Innovation.
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
2015 SLA IT Webinar Using Analytics to Understand Social Media Activity Michelle Chen School of Information San José State University February 4 th, 2015.
PolyAnalyst Data and Text Mining tool Your Knowledge Partner TM www
1.Data categorization 2.Information 3.Knowledge 4.Wisdom 5.Social understanding Which of the following requires a firm to expend resources to organize.
IBM SPSS Solutions A SELECT INTERNATIONAL COMPANY.
1 SEGMENT 2 Decision Support Systems: An Overview.
Online Communities Academic Publishing Perspective.
DEiXTo.
Best Practices Using Enterprise Search Technology Aurelien Dubot Consultant – Media and Entertainment, Fast Search & Transfer (FAST) British Computer Society.
Overview of New Behind the Blackboard for Blackboard Customers APRIL 2012 TM.
Introduction to Content Analytics Ömer Sever IBM SWG Enterprise Content Mangaement.
Delivering Knowledge for Health Annette Thain. Delivering Knowledge for Health Support for networks and communities People Technology.
Module 3: Business Information Systems
The SEASR project and its Meandre infrastructure are sponsored by The Andrew W. Mellon Foundation SEASR Overview Loretta Auvil and Bernie Acs National.
Making Good Use of Data at Hand: Open Source Tools Mark C. Cooke, Ph.D. Tax Management Associates, Inc.
Computer Concepts 7th Edition Parsons/Oja Chapter 3 Computer Software Section A: Software Basics.
Schmap Inc. All rights reserved. DEMOGRAPHICS PRO & APPLIED SOCIAL DATA o OBLIGATORY SOCIAL MEDIA HYPE o WHAT IS DEMOGRAPHICS.
Evaluation of Adaptive Web Sites 3954 Doctoral Seminar 1 Evaluation of Adaptive Web Sites Elizabeth LaRue by.
Audio/Video eResources. Audio Audacity –Audacity is free, open source software for recording and editing sounds.
Machine Learning for Language Technology Introduction to Weka: Arff format and Preprocessing.
Methods for the Automatic Construction of Topic Maps Eric Freese, Senior Consultant ISOGEN International.
2007 EOSDIS User Survey Carol Boquist ESDIS Outreach Manager Science Operations Office 11/7/2007 Carol Boquist ESDIS Outreach Manager Science Operations.
What’s new with UCAS? Alan Jones, Professional Development Executive.
Towards an Experience Management System at Fraunhofer Center for Experimental Software Engineering Maryland (FC-MD)
Sonali Bhasin. Abstract  Software organizations that are transitioning from traditional method to Agile development methods.  Study various challenges.
TM Copyright © 2009 NMQA Ltd. Behaviour Driven Testing with.
©Copyright Artificial Solutions 2015 Artificial Solutions & the Teneo Platform Making Technology Think September 2015.
WEEK INTRODUCTION CSC426 SOFTWARE ENGINEERING.
ICT TOOLS AND SOCIETY INVOLVEMENT AMONG THE EUPAN NETWORK HIGHLIGHTS FROM THE SURVEY RESULTS TANYA CHETCUTI AND MARCO FICHERA - WORKSHOP EUROPEAN COMMISSION.
BOĞAZİÇİ UNIVERSITY DEPARTMENT OF MANAGEMENT INFORMATION SYSTEMS MATLAB AS A DATA MINING ENVIRONMENT.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Data Mining Tools some examples.
CSC 594 Topics in AI – Text Mining and Analytics
CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.
Analytical People 11 (When and) Why R wins EARL Conference 16 th September 2014 John McConnell – Analytical People Information and Data Management.
Spam Detection Kingsley Okeke Nimrat Virk. Everyone hates spams!! Spam s, also known as junk s, are unwanted s sent to numerous recipients.
Semantic Wiki: Automating the Read, Write, and Reporting functions Chuck Rehberg, Semantic Insights.
English-Lithuanian-English Lexicon Database Management System for MT Gintaras Barisevicius and Elvinas Cernys Kaunas University of Technology, Department.
Reaching out to data users regarding next-generation news releases
Data Mining Tools some examples.
An Inside look at Enterprise Insights for Tax (EIT) – a New Tax Data Hub for SAP HANA Dr. Bjarne Berg Todd Bixby PricewaterhouseCoopers Sapphire.
An Inside look at the New Tax Data Hub for SAP HANA
Seminar CS2310 Multimedia Software Engineering Krithika Ganesh
What is a CMS. CMS is content management system CMS is a software that stores content.
Charles Tappert Seidenberg School of CSIS, Pace University
What am I doing We are building a platform for creating Siri like interfaces for software products using a human-computer dictionary Our customers will.
Logic: tool-based modeling and reasoning
Presentation transcript:

A journey into Text Analytics John McConnell Analytical People ASC Winchester 7th September 2013 © analytical-people 2013

Contents Background & Objectives Our current view on Text Analytics – Value – Process An example application Conclusions 2

Background Text Analytics and Text Mining are largely synonymous Interest and execution of Text Analytics is growing – Social Media sources are largely responsible for this – And that often means “Big Data” This should lead to further improvements in technology and methodology which will benefit survey practitioners 3

Objectives We’ve been involved in more Text Analytics work in the last 2 years than in all previous years Our objective in this presentation is to share some of our experience and thoughts around some of the technology we have used 4

The Value Propositions 1.Reduce cost (and time) 2.Generating actionable insights – Improve public and commercial processes 5 *

Using Text Analytics to find Text Analytics software 6

3 Software tools 7 R Open Source Statistical Platform Command driven Rapid Miner Open Source Data Mining Workbench GUI Built on R and Weka SPSS Text Analytics for Surveys Commercial Text Analytics GUI

Unstructured data Structured data The Process – Highest Level

1. Extract2. Refine 3. Analyse Process – Level 2

How can we tell if we are using the right tool(s)? 10 Extract How good is the first extraction? How long to get to an acceptable extraction? Refine How easy is to refine? How easy is to capture refinements to re-use them in future? Analyse What tools exist to support the Text Analytics process? What tools exist to use the Structured Text in other analyses? How well do the tools/methods deliver on the value propositions?

Algorithms and Dictionaries Extract Algorithms e.g. Natural Language Processing (NLP) Dictionaries Variously called Lexicons, Resources, Libraries, etc. Are usually contextual e.g. Customer Satisfaction

Example Data The American Physical Society (APS) Student Survey Comments from 2009 (Base=1304) Q4.2 Comments about the best features of and what could be added or improved to the special programses for Student Members* 12 *

The first extraction with R 13 library("tm", lib.loc="C:/Users/jmcconnell/Documents/R/win- library/3.0") APS2009df = read.csv("C:/AP/ASC/APS/APS2009Verbatims.csv", header = TRUE) text_corpus <- Corpus(VectorSource(APS2009df), readerControl = list(language = "en")) summary(text_corpus) #check what went in text_corpus <- tm_map(text_corpus, removeNumbers) text_corpus <- tm_map(text_corpus, removePunctuation) text_corpus <- tm_map(text_corpus, stripWhitespace) text_corpus <- tm_map(text_corpus, tolower) We apply a basic set of text handling methods (simple NLP) e.g. removePunctuation We also apply a small dictionary of known “Stopwords” (not shown)

R Extraction Results – Top 20 Terms 14

The first extraction with Rapid Miner 15 We visually construct a similar set of steps

Rapid Miner Extraction Results – Top 20 Terms 16

Improving and creating new data Refine Improve the extraction Correct mistakes Add omissions Map the extraction to structured data Group and combine meaningful terms that will become data for further analysis In second and subsequent waves (where applicable) Refine should be a shorter step where we look for new concepts

Rapid Miner - Refine 18 We add one process step to fix up some of the issues in the first extraction Filter Tokens sets a lower limit for the length of an extracted term/attribute

Rapid Miner results after first refinement 19

The first extraction with SPSS 20 SPSS Uses a Wizard to specify the extraction steps

SPSS Extraction Results – Top 20 Terms 21 Synonyms are used from the dictionaries SPSS Is counting respondents not occurrences

Synonyms for “Excellent” stars, 10/10, 100 % correct, 100% accurate, 100% correct, 100% grade a, 5 star, 5 stars, 5-star, ^ best $, ^ great $, a must, a nice plus, a plus, a+, a++, aagood, above and beyond, above excellence, absolute life saver, absolute word class, acceptional, admirable, all was well, allright, alright, always a please, amazing, among the best, among the very best, appreciable, appreciative, award winning, awesome, awesopme, awsome, beenfantastic, best asset, best of all, best possible, beyond expectation, beyond expectations, big asset, big beast, big hit, big hits, big kudos to, big plus, blow all others away, blows all others away, blows the doors off, brilliant $, can not be beat, can't be beat, can't beat, cannot be beat, capable, capible, class service, compliment, compliment one another well, congrats, congratulations, copious, cutting edge, cutting-edge, dandy, delight, deluxe, deserves a raise, deserves credit, does that well, doing her best, doing his best, doing their best, done very well, dynamite, exccellent, excelent, excellant, excellence, excellet, excelllent, excepional, exceptional, exceptionl, execellent, exelant, exelent, exellant, exellecent, exellent, exlt, expectional, exquis, exquise, exquises, exquisite, exquisitely $, extraordinary, extrodinary, fabulous, fairly well, fanatstic, fantabulous, fantasic, fantastic, fantatic, finest, first class, first-class, first-rate, five stars, formidable, frantastic, given me the most, godsend, goes over well, goodd, gooood, graet, grat, grea, greaat, great pleasure, greate, greatest, greeeeeeeaaattttt, gret, greta, hats down, hats off, head and shoulders better, heavenly, high hats off, ideal, impecable, impeccable, impress, impresses me most, impressive, in an orderly fashion, incomparable, incredibe, incredible, increible, indisputable, ingenious, inpecable, invaluable, is still the best, it was a pleasure, knock socks off, knock spots off, kudos, kudos to, laudable, lifesaver, made an impression, made the difference, magnificent, marvellous, marvelous, my compliments to, nicest, number 1, number one, oustanding, out of the woods, out of the world, out of this world, outperform, outperforming, outsanding, outstanding, peachy, perfect, perfection, perfectly done, phenomenal, phenominal, pleasure of working with, prettier, pretty good, quintessential, reach a ten, real good, real nice, remarkable, right direction, rock $, rocked my world, second to none, sensational, smashing, spectacular, spendid, splendid, stand head & shoulders above, stand head and shoulders above, standing head & shoulders above, standing head and shoulders above, stands head & shoulders above, stands head and shoulders above, stood head & shoulders above, stood head and shoulders above, strong positive, superb, supurb, surpassed my expectations, surreal, sweetheart, ten stars, terric, terrific, terrifig, the best, the best one so far, the best thing, the highlight of, the only one that works, thebest, think highly, think very highly, to die for, top notch, top quality, top ranked, top-flight, top-notch, top-of-the-line, top-ranked, top-ranking, topflight, topnotch, topranked, topranking, tremedous, tremendous, tried and proven, trmendous, turn out good, two thumbs up, unbeatable, unmatched, unmnatched, unparalleled, unquestionable, unquestionnable, unsurpassed, up 2 standard, up 2 standards, up 2 usual standards, up to standard, up to standards, up to usual standards, up to your usual standards, up-beat, upbeat, utmost, v-good, well done, went above and beyond my expectations, woderful, womderful, wondeful, wonderful, wonderfull, wonedeful, wonederful, would be the smartest, wounderful

Adding Wordnet to our R (/RapidMiner) analysis 23 library("wordnet") setDict ("C:/Wordnet/WordNet-3.0/dict") synonyms("excellent", "ADJECTIVE") [1] "excellent" "fantabulous" "first-class" "splendid"

Analytics to aid refinement 24

Job … Fair 25 Students are asking for more “stuff” at the job fair

R Extraction Results – Top 20 Terms 26

Onward to analysis 27 *This is an anonymised example

Onward to analysis 28 R In R we are in a statistical platform already Text Analytics outputs are part of the data in the current “Workspace” For Research style charts and tables we may need to export data Rapid Miner In RM we are in a Data Mining platform already Text Analytics is part of the current process flow SPSS Text Analytics for Surveys Data needs to be exported elsewhere for Analysis To SPSS.sav, Excel or Data Collection 3. Analyse

A High Level Comparison 29 AttributeRRapid MinerSPSS TAfS Help & SupportLot of User Generated Content Lots of User Generated Content Paid support option Paid support UsabilityLow level coding control Visual programmingVisual UI Scalability R in itself isn’t too scalable but many scalable implementations exist e.g. Revolution, Hadoop RadoopWe experienced Issues with data sets around 100,000 cases* ExtensibilityVarious options None AutomationCan be run in batchCan run in batchNone OverallGreat for the coder. Those familiar with R The power of R with a GUI The most graphical and tuned for Generic survey types e.g. Opinions *IBM/SPSS have a Text Analytics option for Data Mining which may be more scalable – we haven’t tested yet

Our current conclusions Dictionaries help in the initial extraction – But it is almost inevitable you will want to extend them to get to the specificity of the study. If the study domain is very specific you can build your own dictionaries in all 3 tools. A lot of social media monitoring starts with libraries of regular expressions built from the ground up. Open Source tools like R and Rapid Miner will continue to improve with “packages” added by the R community There is no “silver bullet”. The Refine step will typically require a lot of manual input – Especially in the initial “build” phase – More is required on larger surveys But the ROI – in time and/or cost - should be clear – And the results more robust and reliable 30

A journey into Text Analytics Thank-you & Questions John McConnell Analytical People