Text Mining : Twitter
Agenda Making connection Fetching tweets/data Mining of tweets
Making connection Twitter API requires authentication since March 2013. Twitter has closed the access to version 1.0 of API. Current version is 1.1 (till date) You have to create a Twitter application to generate Twitter API Keys, Access Token and secret keys and so on. OAuth - An open protocol to allow secure authorization in a simple and standard method from web, mobile and desktop applications.(https://dev.twitter.com/oauth) JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. It is based on a subset of the JavaScript
Making connection (cont…) You must have a twitter account for that. To get Twitter Access keys, you need to create Twitter Application which is mandatory to access Twitter. Go to https://apps.twitter.com and log in, if necessary. Then create an application. Enter your Application Name, Description and your website address. You can leave the callback URL empty.
Making connection (cont…) A new application is created as shown below.
Making connection (cont…) Change permissions if necessary(depending if you want to just read, write or execute)
Making connection (cont…) Check - Allow this application to be used to Sign in with Twitter. Submit the form by clicking the Update Setting.
Making connection (cont…) Generate access token.
Making connection (cont…) Copy the consumer key (API key) and consumer secret from the screen into your application Example – (do not use these as these are not valid) Consumer key – “96KlLEJaRtrEXo3uvFBRMvFu2” Consumer secret – “NUzR0iJOgMQXWzaNeOZbXD5wHnmBslnRjcRX3my6xK3ryDCLOP”
Twitter Authentication using R library (httr) library(twitteR) library(RCurl) consumer_key <-‘gg34l45ll5m' consumer_secret <-‘kffknfkndnknknfnkfnknf' access_token <- ‘ffnfnfnfnehwe9r' access_secret <- ‘idiofhiehfiehfhe' setup_twitter_oauth (consumer_key, consumer_secret, access_token, access_secret) Do not use these key put your own key
Proxy Setting in R httr package function set_config(use_proxy(url='192.168.15.5',8000,NULL,NULL)) set_config(use_proxy(url='192.168.15.5',8000,NULL,NULL))
Fetching Tweets/Data from Twitter searchTwitter() For more information on different functions: http://cran.r-project.org/web/packages/twitteR/twitteR.pdf tweets <- searchTwitter("iphone 6",n=500,lang="en") ## number of tweets returned may be less than 500 df <- twListToDF(tweets) iphone <- unique(as.character(df$text)) ## after removing duplicate tweets, there are (###) tweets in this example iphone <- gsub("\n"," ",iphone,fixed=TRUE) ## you can save the data in a file # name <- file(description="F:/sample.txt",open="w") name <- file(description="sample.txt",open="w") # will be saved in working directory. write(x=iphone,file=name) close(con=name) <html lang="en"> <html lang="en-US"> The first lang tag only specifies a language code. The second specifies a language code, followed by a country code. http://www.endmemo.com/program/R/gsub.php ( for gsub)
Anatomy of a Tweet 1 .profile pic https://media.twitter.com/best-practice/anatomy-of-a-tweet
Text Mining on Tweets (Pre-processing step first) Pre-processing of tweets is done first. library(tm) library(stringr) tweets <- scan("F:/sample.txt",what=character(0),sep="\n") tweets <- gsub("&"," ",tweets,fixed=TRUE) tweets <- gsub(">"," ",tweets,fixed=TRUE) tweets <- gsub("<"," ",tweets,fixed=TRUE) tweets <- gsub("\n"," ",tweets,fixed=TRUE) tweets <- gsub("\\n"," ",tweets,fixed=TRUE) tweets <- gsub("_","",tweets,fixed=TRUE) tweets <- gsub("[0-9]+","",tweets) tweets <- gsub("'","",tweets,fixed=TRUE) tweets <- tolower(tweets) Special symbol in html ( html Unicode , asci Numerical values For regular expression do not Fixed = true , exactly substitute (lower case upper case) In theoretical computer science and formal language theory, a regular expression (abbreviatedregex or regexp and sometimes called a rational expression[1][2]) is a sequence of charactersthat define a search pattern, mainly for use in pattern matching with strings, or string matching, i.e. "find and replace"-like operations. The concept arose in the 1950s, when the American mathematician Stephen Kleene formalized the description of a regular language, and came into common use with the Unix text processing utilities ed, an editor, and grep (global regular expression print), a filter.
Text Mining on Tweets (Pre-processing step first) reg <- "(^|[^@\\w])@(\\w{1,35})\\b" tweets <- gsub(reg,"",tweets) reg <- c("aaa(a)*","bbb(b)*","ccc(c)*","ddd(d)*","eee(e)*","fff(f)*","ggg(g)*","hhh(h)*","iii(i)*","jjj(j)*","kkk(k)*","lll(l)*","mmm(m)*","nnn( n)*", "ooo(o)*","ppp(p)*","qqq(q)*","rrr(r)*","sss(s)*","ttt(t)*","uuu(u)*","vvv(v)*","www(w)*","xxx(x)*","yyy(y)*","zzz(z)*") rep <- c("aa","bb","cc","dd","ee","ff","gg","hh","ii","jj","kk","ll","mm","nn","oo","pp","qq","rr","ss","tt","uu","vv","ww","xx","yy","zz") for(i in 1:26) { tweets <- str_replace_all(tweets,reg[i],rep[i]) } myCorpus <- Corpus(VectorSource(tweets)) myCorpus <- tm_map(myCorpus, removePunctuation) removeURL <- function(x) gsub("http[[:alnum:]]*", "", x) myCorpus <- tm_map(myCorpus, removeURL) myCorpus <- tm_map(myCorpus, removeWords, stopwords("english")) myCorpus <- tm_map(myCorpus, stemDocument) myCorpus <- tm_map(myCorpus, PlainTextDocument) To remove username (regular expression) @username No more than 2 consecutive character in the word. Stem – root word
Term-document matrix generation, Frequency analysis and word-cloud generation library(vegan) myTdm <- TermDocumentMatrix(myCorpus, control=list(wordLengths=c(3,Inf))) minfreq <- 20 findFreqTerms(myTdm, lowfreq=minfreq) termFrequency <- rowSums(as.matrix(myTdm)) termFrequency <- subset(termFrequency, termFrequency>=minfreq) library(ggplot2) qplot(names(termFrequency), termFrequency, , xlab="Terms") + coord_flip() library(wordcloud) m <- as.matrix(myTdm) wordFreq <- sort(rowSums(m), decreasing=TRUE) v <- which(wordFreq>5) wordFreq <- wordFreq[v] grayLevels <- gray( (wordFreq+10) / (max(wordFreq)+10) ) wordcloud(words=names(wordFreq), freq=wordFreq, min.freq=3, random.order=F,colors=grayLevels) qplot(names(termFrequency), termFrequency, stat = "identity", geom="bar", xlab="Terms") + coord_flip()
Term-document matrix generation, Frequency analysis and word-cloud generation Word cloud is the pictorial representation of the frequency of different words. In this example, word “iphon” (which is stemmed version of iphone) has the highest frequency. Size of a word in the word cloud is proportional to its frequency in the data.
Clustering of words Total number of documents (tweets) in the dataset is 485 We use hierarchical clustering algorithm with ward linkage method Number of clusters are taken as 3 (shown in red blocks in the figure) myTdm2 <- removeSparseTerms(myTdm, sparse=0.95) ## those terms which are not present in 95% or more documents are removed (sparse terms) m2 <- as.matrix(myTdm2) distMatrix <- dist(scale(m2)) fit <- hclust(distMatrix, method="ward") plot(fit) rect.hclust(fit, k=3)