Instructor: Prof. Louis Chauvel Advanced Statistical Analysis: Text analysis with Stata txttool, R R.temis and Hyperbase (15FEB2019) Instructor: Prof. Louis Chauvel
This session General references What’s the matter? Text(ual) analysis, lexicometry, text mining, … STATA tools R tools Hyperbase
References: STATA ADVANCED MANUAL: Set of references « As usual »: http://www.louischauvel.org/stata_manuel_advanced.pdf Plus more recent …
Main references Find them online on http://www.a-z.lu/ ALMOST NONE, recently, apart: Practical Text Mining and Statistical Analysis for Non-Structured Text Data Applications Author: Gary Miner, , John, IV Elder, , Andrew Fast, , Thomas Hill, , Robert Nisbet, , and Dursun Delen Too much, too heavy, too general, but it is the reference …
SEE ALSO Find this online at : https://mhealth.jmir.org/2018/4/e101/
What is to be done? Open, long, answer to a question / issue / interview Typical case: 30-100 interviews of several minutes to 1 hours+ In general you personally know your sample And you have some additional indicators on what they are Description of contents of speech / content / matter / style … to understand major cleavages through what people say A quantitative extension of qualitative research
Typical processing: Data management Clean (lower case, punctuation, quotes, ???) Format the data (different in each software) Import the data in the software (many issues) “Stopwords” and lemmatization (suppress grammatical flection) "Stemmization" (see Porter Stemmer ) Data processing Dictionary and sub-counts of words what they speak of and who Concordance / Correspondence of words coherence of words / people Factor & cluster analyses contrasts and grouping of words / people … interpretations … https://tartarus.org/martin/PorterStemmer/ https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
Typical processing: Data processing Main issue proximity of words and of texts we need a matrix of relations (proximity) many solutions (differences, ratios) relatively common solution consider the row=words column=person table of frequency add .5 frequency per cell (zeros) log the frequency compute the mean of log(fr) per row keep the difference between log(fr) and mean of log(fr) per row + and – are a good indicator of attraction/replusion of word/pers Factor & cluster analyses
Qualitative research oriented Nvivo Atlas-ti Etc Softwares Qualitative research oriented Nvivo Atlas-ti Etc Quantitative method oriented STATA and R commands Hyperbase (FR) WordStat (for STATA) TextAnalyst
Neil Gorsuch Typical case 22 U.S. Senators R&D Opening Declaration in U.S. Ass. Justice Hearings https://www.youtube.com/watch?v=RlJEXiZONrQ https://www.congress.gov/115/chrg/shrg28638/CHRG-115shrg28638.htm
Neil Gorsuch Typical case 22 U.S. Senators Opening Declaration in U.S. Ass. Justice Hearings 22 extensive transcript (10 minutes) We know their names https://eugdpr.org/ GDPR issues? = NO, it is public… In the dataset: name+ d/r = political party and transcript You love U.S. politics ? https://en.wikipedia.org/wiki/Neil_Gorsuch_Supreme_Court_nomination#Committee https://en.wikipedia.org/wiki/United_States_Senate_Committee_on_the_Judiciary#Members,_115th_Congress See texts here http://www.louischauvel.org/Gorsuch.doc https://www.youtube.com/watch?v=RlJEXiZONrQ
Raw Material
PART 1 STATA and text analysis Have a Stata 13 minimum … The long string text has almost no limitation Copy-Paste is a simple way to import data So… Important STATA ssc install module: ssc install txttool Provides Porter stemming option (stem) and counts of words (bag) The rest is usual multidimensional descriptive analysis (factor and cluster) Exemple : STATA syntax http://www.louischauvel.org/gorsuch.do WE PROCEED NOW!
PART 2 R and text analysis R.temis, a new (V2) R Package (TExt MIning Solution) https://cran.r-project.org/web/packages/R.temis/R.temis.pdf First install the latest version of R-Studio (with the latest version of R) Install the package R.temis Additional formatting requirements Exemple : R-script http://www.louischauvel.org/gorsuch.R http://www.louischauvel.org/Rtemis_FR.docx https://rtemis.hypotheses.org/ https://cran.r-project.org/web/packages/R.temis/index.html WE PROCEED NOW!
PART 3 HYPERBASE and text analysis available for free here for free http://ancilla.unice.fr/ The + Free, robust, appropriate for multilingual contexts But old, French, and at some point you have to go back to part I = STATA
Main references Find them online on http://www.a-z.lu/ https://www.stata-journal.com/sjpdf.html?articlenum=dm0077 http://ancilla.unice.fr/bases/manuel.pdf