Download presentation
Presentation is loading. Please wait.
1
Instructor: Prof. Louis Chauvel
Advanced Statistical Analysis: Text analysis with Stata txttool, R R.temis and Hyperbase (15FEB2019) Instructor: Prof. Louis Chauvel
2
This session General references What’s the matter?
Text(ual) analysis, lexicometry, text mining, … STATA tools R tools Hyperbase
3
References: STATA ADVANCED MANUAL: Set of references « As usual »:
Plus more recent …
4
Main references Find them online on http://www.a-z.lu/
ALMOST NONE, recently, apart: Practical Text Mining and Statistical Analysis for Non-Structured Text Data Applications Author: Gary Miner, , John, IV Elder, , Andrew Fast, , Thomas Hill, , Robert Nisbet, , and Dursun Delen Too much, too heavy, too general, but it is the reference …
5
SEE ALSO Find this online at :
6
What is to be done? Open, long, answer to a question / issue / interview Typical case: interviews of several minutes to 1 hours+ In general you personally know your sample And you have some additional indicators on what they are Description of contents of speech / content / matter / style … to understand major cleavages through what people say A quantitative extension of qualitative research
7
Typical processing: Data management
Clean (lower case, punctuation, quotes, ???) Format the data (different in each software) Import the data in the software (many issues) “Stopwords” and lemmatization (suppress grammatical flection) "Stemmization" (see Porter Stemmer ) Data processing Dictionary and sub-counts of words what they speak of and who Concordance / Correspondence of words coherence of words / people Factor & cluster analyses contrasts and grouping of words / people … interpretations …
8
Typical processing: Data processing
Main issue proximity of words and of texts we need a matrix of relations (proximity) many solutions (differences, ratios) relatively common solution consider the row=words column=person table of frequency add .5 frequency per cell (zeros) log the frequency compute the mean of log(fr) per row keep the difference between log(fr) and mean of log(fr) per row + and – are a good indicator of attraction/replusion of word/pers Factor & cluster analyses
9
Qualitative research oriented Nvivo Atlas-ti Etc
Softwares Qualitative research oriented Nvivo Atlas-ti Etc Quantitative method oriented STATA and R commands Hyperbase (FR) WordStat (for STATA) TextAnalyst
10
Neil Gorsuch Typical case
22 U.S. Senators R&D Opening Declaration in U.S. Ass. Justice Hearings
11
Neil Gorsuch Typical case
22 U.S. Senators Opening Declaration in U.S. Ass. Justice Hearings 22 extensive transcript (10 minutes) We know their names GDPR issues? = NO, it is public… In the dataset: name+ d/r = political party and transcript You love U.S. politics ? See texts here
12
Raw Material
13
PART 1 STATA and text analysis
Have a Stata 13 minimum … The long string text has almost no limitation Copy-Paste is a simple way to import data So… Important STATA ssc install module: ssc install txttool Provides Porter stemming option (stem) and counts of words (bag) The rest is usual multidimensional descriptive analysis (factor and cluster) Exemple : STATA syntax WE PROCEED NOW!
14
PART 2 R and text analysis
R.temis, a new (V2) R Package (TExt MIning Solution) First install the latest version of R-Studio (with the latest version of R) Install the package R.temis Additional formatting requirements Exemple : R-script WE PROCEED NOW!
15
PART 3 HYPERBASE and text analysis
available for free here for free The + Free, robust, appropriate for multilingual contexts But old, French, and at some point you have to go back to part I = STATA
16
Main references Find them online on http://www.a-z.lu/
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.