Presentation is loading. Please wait.

Presentation is loading. Please wait.

Instructor: Prof. Louis Chauvel

Similar presentations


Presentation on theme: "Instructor: Prof. Louis Chauvel"— Presentation transcript:

1 Instructor: Prof. Louis Chauvel
Advanced Statistical Analysis: Text analysis with Stata txttool, R R.temis and Hyperbase (15FEB2019) Instructor: Prof. Louis Chauvel

2 This session General references What’s the matter?
Text(ual) analysis, lexicometry, text mining, … STATA tools R tools Hyperbase

3 References: STATA ADVANCED MANUAL: Set of references « As usual »:
Plus more recent …

4 Main references Find them online on http://www.a-z.lu/
ALMOST NONE, recently, apart: Practical Text Mining and Statistical Analysis for Non-Structured Text Data Applications Author: Gary Miner, , John, IV Elder, , Andrew Fast, , Thomas Hill, , Robert Nisbet, , and Dursun Delen Too much, too heavy, too general, but it is the reference …

5 SEE ALSO Find this online at :

6 What is to be done? Open, long, answer to a question / issue / interview Typical case: interviews of several minutes to 1 hours+ In general you personally know your sample And you have some additional indicators on what they are Description of contents of speech / content / matter / style … to understand major cleavages through what people say A quantitative extension of qualitative research

7 Typical processing: Data management
Clean (lower case, punctuation, quotes, ???) Format the data (different in each software) Import the data in the software (many issues) “Stopwords” and lemmatization (suppress grammatical flection) "Stemmization" (see Porter Stemmer ) Data processing Dictionary and sub-counts of words  what they speak of and who Concordance / Correspondence of words  coherence of words / people Factor & cluster analyses  contrasts and grouping of words / people … interpretations …

8 Typical processing: Data processing
Main issue  proximity of words and of texts  we need a matrix of relations (proximity)  many solutions (differences, ratios)  relatively common solution consider the row=words column=person table of frequency add .5 frequency per cell (zeros) log the frequency compute the mean of log(fr) per row keep the difference between log(fr) and mean of log(fr) per row + and – are a good indicator of attraction/replusion of word/pers Factor & cluster analyses

9 Qualitative research oriented Nvivo Atlas-ti Etc
Softwares Qualitative research oriented Nvivo Atlas-ti Etc Quantitative method oriented STATA and R commands Hyperbase (FR) WordStat (for STATA) TextAnalyst

10 Neil Gorsuch Typical case
22 U.S. Senators R&D Opening Declaration in U.S. Ass. Justice Hearings

11 Neil Gorsuch Typical case
22 U.S. Senators Opening Declaration in U.S. Ass. Justice Hearings 22 extensive transcript (10 minutes) We know their names GDPR issues? = NO, it is public… In the dataset: name+ d/r = political party and transcript You love U.S. politics ? See texts here

12 Raw Material

13 PART 1 STATA and text analysis
Have a Stata 13 minimum … The long string text has almost no limitation Copy-Paste is a simple way to import data So… Important STATA ssc install module: ssc install txttool Provides Porter stemming option (stem) and counts of words (bag) The rest is usual multidimensional descriptive analysis (factor and cluster) Exemple : STATA syntax WE PROCEED NOW!

14 PART 2 R and text analysis
R.temis, a new (V2) R Package (TExt MIning Solution) First install the latest version of R-Studio (with the latest version of R) Install the package R.temis Additional formatting requirements Exemple : R-script WE PROCEED NOW!

15 PART 3 HYPERBASE and text analysis
available for free here for free The + Free, robust, appropriate for multilingual contexts But old, French, and at some point you have to go back to part I = STATA

16 Main references Find them online on http://www.a-z.lu/


Download ppt "Instructor: Prof. Louis Chauvel"

Similar presentations


Ads by Google