Provalis Research Corp.

Slides:



Advertisements
Similar presentations
A N I NTRODUCTION TO QDA M INER: or IS QDA MINER REALLY A BETTER SOLUTION FOR MIXED METHODS RESEARCH? By Normand Péladeau President Provalis Research Corp.
Advertisements

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Predictor of Customer Perceived Software Quality By Haroon Malik.
Content analysis in scientific narrative psychology (NARRCAT) János László Institute of Psychology of the HAS and University of Pécs.
IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.
Building an Intelligent Web: Theory and Practice Pawan Lingras Saint Mary’s University Rajendra Akerkar American University of Armenia and SIBER, India.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
WHT/ HPCC Systems Flavio Villanustre VP, Products and Infrastructure HPCC Systems Risk Solutions.
Social Determinants of Mental Health: Contributions ? John Cairney Health Systems Research Unit Centre for Addiction and Mental Health Department of Psychiatry.
Critical Thinking, Cognitive Presence, and Computer Conferencing Norm Friesen May 6, 2006.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Text Search and Fuzzy Matching
The SEASR project and its Meandre infrastructure are sponsored by The Andrew W. Mellon Foundation SEASR Overview Loretta Auvil and Bernie Acs National.
Text Analysis Everything Data CompSci Spring 2014.
Transliteration Transliteration CS 626 course seminar by Purva Joshi Mugdha Bapat Aditya Joshi Manasi Bapat
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Understanding Knowledge There is More to Knowledge than Might be Known.
ENG 626 CORPUS APPROACHES TO LANGUAGE STUDIES exploring frequencies in texts Bambang Kaswanti Purwo
NLP And The Semantic Web Dainis Kiusals COMS E6125 Spring 2010.
Use of Hierarchical Keywords for Easy Data Management on HUBzero HUBbub Conference 2013 September 6 th, 2013 Gaurav Nanda, Jonathan Tan, Peter Auyeung,
VMT CSCL Workshop June VMT CSCL workshop Evaluation & analysis.
Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,
Case Study Summary Link Translation entered a partner agreement with Autodesk to provide translation solutions integrating human and machine translation.
Chapter 6: Information Retrieval and Web Search
BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
DC AAPOR Summer Conference, Washington DC June 21-22, 2012 Casey Langer Tesfaye American Institute of Physics Georgetown University Free Range Research.
Fall 2004EE 3563 Digital Systems Design EE 3563 VHSIC Hardware Description Language  Required Reading: –These Slides –VHDL Tutorial  Very High Speed.
8 December 1997Industry Day Applications of SuperTagging Raman Chandrasekar.
Info Sabanci University start-up company founded in March 2013 by academicians and graduate students from Sabanci University. We develop social media.
Explain How Researchers Use Inductive Content Analysis (Thematic Analysis) on Transcripts.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Info Start-up company founded by academicians and graduate students from Sabanci University. We offer social media analysis tools and services including.
Chapter 2 Data, Text, and Web Mining. Data Mining Concepts and Applications  Data mining (DM) A process that uses statistical, mathematical, artificial.
Document Parsing Paolo Ferragina Dipartimento di Informatica Università di Pisa.
Twitter as a Corpus for Sentiment Analysis and Opinion Mining
IDENTIFYING GREAT TEACHERS THROUGH THEIR ONLINE PRESENCE Evanthia Faliagka, Maria Rigou, Spiros Sirmakessis.
Enterprise Service Bus
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Future-oriented Benchmarking Through Social Media Analysis
CSE5544 Final Project Interactive Visualization Tool(s) for IEEE Vis Publication Exploration and Analysis Team Name: Publication Miner Team Members:
CSE5544 Final Project Interactive Visualization Tool(s) for IEEE Vis Publication Exploration and Analysis Team Name: Publication Miner Team Members:
Information Retrieval and Web Search
Content analysis, thematic analysis and grounded theory
Topics in Linguistics ENG 331
Information Retrieval and Web Search
What to Look for Mathematics Grade 6
Linguistic Analysis for Subject Identification
DATA ANALYTICS AND TEXT MINING
Automated MS Word and PowerPoint Translator
Semantic Interoperability and Data Warehouse Design
Education 173 Cognition and Learning in Educational Settings Critical Thinking and Reasoning Fall Quarter, 2007.
A method for WSD on Unrestricted Text
Text Analytics Workshop: Introduction
Review for Midterm Exam
Multi Core Processing What is term Multi Core?.
Pooria Taghizadeh : Dr. Hadi Tabatabaee : Dr. Mona Ghassemian :
Chapter 5: Information Retrieval and Web Search
Content Analysis of Text
Symbolic AI 2.0 Yi Zhou.
Introduction to Text Analysis
Guided Research: Intelligent Contextual Task Support for Mails
Instructor: Prof. Louis Chauvel
Mixed Up Multiplication Challenge
Sentiment Analysis In Student Learning Experience By Obinna Obeleagu
Sentiment Analysis In Student Learning Experience By Obinna Obeleagu
Education 173 Cognition and Learning in Educational Settings Critical Thinking and Reasoning Fall Quarter, 2007.
Pitch Deck.
Yining ZHAO Computer Network Information Center,
Data Analysis, Interpretation, and Presentation
Presentation transcript:

Provalis Research Corp. Normand Péladeau President Provalis Research Corp. peladeau@provalisresearch.com

Provalis Research Some commercial clients:

Provalis Research Some governmental and NGO clients:

Provalis Research Some Academic Clients:

Our Products Statistical Analysis & Bootstrapping 1989

Our Products

Our Products 1989 1998 2004 Statistical Analysis & Bootstrapping Content Analysis & Text Mining 2004 Qualitative Analysis & Mixed Methods

WordStat for Stata June 2013 Stata 13 Content Analysis & Text Mining

Necessity is the mother of invention issues Innovation Invention Challenges

Laziness Necessity is the mother of invention Innovation Challenges issues Innovation Invention Challenges

Text Analytics Challenge THREE MAJOR OBSTACLES 1) Very large number of word forms

1,8 million course evaluations Challenge #1 – Quantity 38,996 comments about hotels 2,1 millions words (tokens) 20,116 word forms (types) 1,8 million course evaluations 35 millions words (tokens) 78,159 word forms (types)

Text Analytics Challenge

Text Analytics Challenge

Positive and Negative Reactions Text Analytics Challenge Positive and Negative Reactions POSITIVES NEGATIVES ARCHITECTURE MESSAGING CRS INTERFACES MARKET NET_NOUNS SIZEANDPOWER ACQUISITION SOFTWARE COMPETITORS PERFORMANCE LAUNCH HARDWARE CISCO CISCOPRODUCTS UPGRADE COST Count 18 15 12 9 6 3

Challenge #1 – Solutions Statistical tools Frequency selection Data reduction techniques (HCA, PCA, FA) Exploratory data analysis (ex. CA). Machine Learning

The Statistics of Text Distribution of words: Zipf distribution

The Statistics of Text 38,988 comments about hotels 2.1 M words (20,114 different words) MOST FREQUENT FORMS PERCENTAGE OF FORMS PERCENTAGE OF WORDS 49 0.24% 50% 300 1.5% 76% 500 2.5% 83% 1000 5.0% 90%

Challenge #1 – Solutions Statistical tools Frequency selection Data reduction techniques (HCA, PCA, FA) Exploratory data analysis (ex. CA). Machine Learning Natural language processing (NLP) tools Stopword list Stemming Lemmatization

Text Mining Approach PROS CONS Very fast Very little efforts Inductive Comparison of results Does not quantify accurately Insensitive to low frequency events

Text Analytics Challenge THREE MAJOR OBSTACLES 1) Very large number of word forms

Text Analytics Challenge THREE MAJOR OBSTACLES 1) Very large number of word forms 2) Polymorphy of language One idea  multiple forms

Content analysis method

Content analysis method

Content analysis method

Content analysis method PROS Can measure more accurately Can be focused (multi-focus) Allows full automation CONS Dictionary construction & validation

Text Analytics Challenge THREE MAJOR OBSTACLES 1) Very large number of word forms 2) Polymorphy of language One idea  multiple forms

Text Analytics Challenge THREE MAJOR OBSTACLES 1) Very large number of word forms 2) Polymorphy of language One idea  multiple forms 3) Polysemy of words One word  many ideas

Challenge #3 – Polysemy of words Keyword in Context List (KWIC) Senses of word “stress” #1 (psychology) a state of mental or emotional strain or suspense #2 (physics) force that produces strain on a physical body #3 Verb - single out as important

Challenge #3 – Polysemy of words Keyword in Context List (KWIC) Disambiguation using phrases STRESS*_THE or STRESS*_THAT  “single out as important” UNDER_STRESS  Emotional State

Challenge #3 – Polysemy of words Keyword in Context List (KWIC) Disambiguation using rules TRANSFER* IS NEAR TECHNOLOGY TRANSFER* IS NOT NEAR BUS SATISFIED IS AFTER #NEGATION

Challenge #4 – Misspellings 78,159 word forms: 46,404 “unknown” words 75 % misspellings (≈ 35,000) 21 % proper names (products & people) 4% acronyms

61 ways to be “Enthusiastic” Challenge #4 – Misspellings 61 ways to be “Enthusiastic”

Challenge #4 – Misspellings Fuzzy and phonetic string comparison algorithms: Damerau-Levenshtein Koelner Phonetik SoundEx Metaphone Double-Metaphone NGram Dice Jaro-Winkler Needleman-Wunch Smith-Waterman-Gotoh Monge-Elkan

Challenge #4 – Misspellings

Challenge #4 – Misspellings

What’s next? Questions?