Text Mining : Twitter.

Slides:



Advertisements
Similar presentations
Overview of Twitter API Nathan Liu. Twitter API Essentials Twitter API is a Representational State Transfer(REST) style web services exposed over HTTP(S).
Advertisements

Samsung Smart TV is a web-based application running on an application engine installed on digital TVs connected to the Internet.
1 Configuring Internet- related services (April 22, 2015) © Abdou Illia, Spring 2015.
ISBN Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl –(on reserve.
Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl Linux editors and commands (e.g.
CS 299 – Web Programming and Design Overview of JavaScript and DOM Instructor: Dr. Fang (Daisy) Tang.
UNIX By Darcy Tatlock. 1. Successful Log Into Unix To actively manipulate your website you need to be logged in. Without being logged in you cannot enter.
The Internet & The World Wide Web Notes
Filters using Regular Expressions grep: Searching a Pattern.
Form Handling, Validation and Functions. Form Handling Forms are a graphical user interfaces (GUIs) that enables the interaction between users and servers.
JavaScript Form Validation
Using the Unix Shell There is No ‘Undelete’. The Unix Shell “A Unix shell is a command-line interpreter or shell that provides a traditional user interface.
Document Type Definitions Kanda Runapongsa Dept. of Computer Engineering Khon Kaen University.
WEB API: WHY THEY MATTER ECOL 453/ Nirav Merchant
Client Scripting1 Internet Systems Design. Client Scripting2 n “A scripting language is a programming language that is used to manipulate, customize,
Chapter 8 Browsing and Searching the Web. Browsing and Searching the Web FAQs: – What’s a Web page? – What’s a URL? – How does a browser work? – How do.
Limits From the initial (HINARI) PubMed page, we will click on the Limits search option. Note also the hyperlinks to Advanced search and Help options.
BIF713 Additional Utilities. Linux Utilities  You have learned many Linux commands. Here are some more that you can use:  Data Manipulation (Reg Exps)
TWITTER DAY /07/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Slide No. 1 Slide No. 1 HTML and Web Publishing Continued CS 104 CS 104.
Lecture 5 Regular Expressions CSCI – 1900 Mathematics for Computer Science Fall 2014 Bill Pine.
GREP. Whats Grep? Grep is a popular unix program that supports a special programming language for doing regular expressions The grammar in use for software.
 Packages:  Scrapy, Beautiful Soup  Scrapy  Website  
-Joseph Beberman *Some slides are inspired by a PowerPoint presentation used by professor Seikyung Jung, which was derived from Charlie Wiseman.
Regular Expressions. What is it 4? Text searching & replacing Sequence searching (input, DNA) Sequence Tracking Machine Operation logic machines that.
1. 2 Regular Expressions Regular Expressions are found in Formal Language Theory and can be used to describe a class of languages called regular languages.
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
Creative Create Lists Elizabeth B. Thomsen Member Services Manager
Teacher Tube Teacher tube is a great source for any digital media to use with your class. It is free to sign up, and you have access to many different.
AdisInsight User Guide July 2015
DATA MIGRATION OF EXISTING TAX PAYER
JavaScript, Sixth Edition
KARES Demonstration.
Consuming OAuth Services in Alfresco Share
CS 330 Class 7 Comments on Exam Programming plan for today:
WikID installation/training
Regular Expressions.
Bash Introduction (adapted from chapters 1 and 2 of bash Cookbook by Albing, Vossing, & Newham) CPTE 440 John Beckett.
Web Development & Design Foundations with HTML5
AIM/education directory (Ed dir)
Sections Text Mining Plan Twitter API twitteR package
CONTENT MANAGEMENT SYSTEM CSIR-NISCAIR, New Delhi
IPOM and E-Booking.
Browsing and Searching the Web
Chapter 7 Text Input/Output Objectives
Subbu Allamaraju BEA Systems Inc
* Lecture # 7 Instructor: Rida Noor Department of Computer Science
All about social networking
CONTENT MANAGEMENT SYSTEM CSIR-NISCAIR, New Delhi
FTP - File Transfer Protocol
Arrays and files BIS1523 – Lecture 15.
Formal Language Theory
Data File Import / Export
Using Fetch to Upload Your Website Source Files
Web Programming– UFCFB Lecture 17
Intro to PHP & Variables
Guide To UNIX Using Linux Third Edition
Configuring Internet-related services
Getting Started: Amazon AWS Account Creation
Agenda OAuth Concepts Programming OAuth.
Lecture 16B: Instructions on how to use Hadoop on Amazon Web Services
Files Management – The interfacing
Designing and Using Normalization Rules
Part 1. Preparing for the exercises
CSCI The UNIX System Regular Expressions
Lab 8: Regular Expressions
Twitter Bot with NodeJS
Computer Network Information Center, Chinese Academy of Sciences
Murach's JavaScript and jQuery (3rd Ed.)
Presentation transcript:

Text Mining : Twitter

Agenda Making connection Fetching tweets/data Mining of tweets

Making connection Twitter API requires authentication since March 2013. Twitter has closed the access to  version 1.0 of API. Current version is 1.1 (till date) You have to create a Twitter application to generate Twitter API Keys, Access Token and secret keys and so on. OAuth - An open protocol to allow secure authorization in a simple and standard method from web, mobile and desktop applications.(https://dev.twitter.com/oauth) JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. It is based on a subset of the JavaScript 

Making connection (cont…) You must have a twitter account for that. To get Twitter Access keys, you need to create Twitter Application which is mandatory to access Twitter. Go to https://apps.twitter.com and log in, if necessary. Then create an application. Enter your Application Name, Description and your website address. You can leave the callback URL empty.

Making connection (cont…) A new application is created as shown below.

Making connection (cont…) Change permissions if necessary(depending if you want to just read, write or execute)

Making connection (cont…) Check - Allow this application to be used to Sign in with Twitter. Submit the form by clicking the Update Setting.

Making connection (cont…) Generate access token.

Making connection (cont…) Copy the consumer key (API key) and consumer secret from the screen into your application Example – (do not use these as these are not valid) Consumer key – “96KlLEJaRtrEXo3uvFBRMvFu2” Consumer secret – “NUzR0iJOgMQXWzaNeOZbXD5wHnmBslnRjcRX3my6xK3ryDCLOP”

Twitter Authentication using R library (httr) library(twitteR) library(RCurl) consumer_key <-‘gg34l45ll5m' consumer_secret <-‘kffknfkndnknknfnkfnknf' access_token <- ‘ffnfnfnfnehwe9r' access_secret <- ‘idiofhiehfiehfhe' setup_twitter_oauth (consumer_key, consumer_secret, access_token, access_secret) Do not use these key put your own key

Proxy Setting in R httr package function set_config(use_proxy(url='192.168.15.5',8000,NULL,NULL)) set_config(use_proxy(url='192.168.15.5',8000,NULL,NULL))

Fetching Tweets/Data from Twitter searchTwitter() For more information on different functions: http://cran.r-project.org/web/packages/twitteR/twitteR.pdf tweets <- searchTwitter("iphone 6",n=500,lang="en") ## number of tweets returned may be less than 500 df <- twListToDF(tweets) iphone <- unique(as.character(df$text)) ## after removing duplicate tweets, there are (###) tweets in this example iphone <- gsub("\n"," ",iphone,fixed=TRUE) ## you can save the data in a file # name <- file(description="F:/sample.txt",open="w") name <- file(description="sample.txt",open="w") # will be saved in working directory. write(x=iphone,file=name) close(con=name) <html lang="en">  <html lang="en-US"> The first lang tag only specifies a language code. The second specifies a language code, followed by a country code. http://www.endmemo.com/program/R/gsub.php ( for gsub)

Anatomy of a Tweet 1 .profile pic https://media.twitter.com/best-practice/anatomy-of-a-tweet

Text Mining on Tweets (Pre-processing step first) Pre-processing of tweets is done first. library(tm) library(stringr) tweets <- scan("F:/sample.txt",what=character(0),sep="\n") tweets <- gsub("&"," ",tweets,fixed=TRUE) tweets <- gsub(">"," ",tweets,fixed=TRUE) tweets <- gsub("<"," ",tweets,fixed=TRUE) tweets <- gsub("\n"," ",tweets,fixed=TRUE) tweets <- gsub("\\n"," ",tweets,fixed=TRUE) tweets <- gsub("_","",tweets,fixed=TRUE) tweets <- gsub("[0-9]+","",tweets) tweets <- gsub("'","",tweets,fixed=TRUE) tweets <- tolower(tweets) Special symbol in html ( html Unicode , asci Numerical values For regular expression do not Fixed = true , exactly substitute (lower case upper case) In theoretical computer science and formal language theory, a regular expression (abbreviatedregex or regexp and sometimes called a rational expression[1][2]) is a sequence of charactersthat define a search pattern, mainly for use in pattern matching with strings, or string matching, i.e. "find and replace"-like operations. The concept arose in the 1950s, when the American mathematician Stephen Kleene formalized the description of a regular language, and came into common use with the Unix text processing utilities ed, an editor, and grep (global regular expression print), a filter.

Text Mining on Tweets (Pre-processing step first) reg <- "(^|[^@\\w])@(\\w{1,35})\\b" tweets <- gsub(reg,"",tweets) reg <- c("aaa(a)*","bbb(b)*","ccc(c)*","ddd(d)*","eee(e)*","fff(f)*","ggg(g)*","hhh(h)*","iii(i)*","jjj(j)*","kkk(k)*","lll(l)*","mmm(m)*","nnn( n)*", "ooo(o)*","ppp(p)*","qqq(q)*","rrr(r)*","sss(s)*","ttt(t)*","uuu(u)*","vvv(v)*","www(w)*","xxx(x)*","yyy(y)*","zzz(z)*") rep <- c("aa","bb","cc","dd","ee","ff","gg","hh","ii","jj","kk","ll","mm","nn","oo","pp","qq","rr","ss","tt","uu","vv","ww","xx","yy","zz") for(i in 1:26) { tweets <- str_replace_all(tweets,reg[i],rep[i]) } myCorpus <- Corpus(VectorSource(tweets)) myCorpus <- tm_map(myCorpus, removePunctuation) removeURL <- function(x) gsub("http[[:alnum:]]*", "", x) myCorpus <- tm_map(myCorpus, removeURL) myCorpus <- tm_map(myCorpus, removeWords, stopwords("english")) myCorpus <- tm_map(myCorpus, stemDocument) myCorpus <- tm_map(myCorpus, PlainTextDocument) To remove username (regular expression) @username No more than 2 consecutive character in the word. Stem – root word

Term-document matrix generation, Frequency analysis and word-cloud generation library(vegan) myTdm <- TermDocumentMatrix(myCorpus, control=list(wordLengths=c(3,Inf))) minfreq <- 20 findFreqTerms(myTdm, lowfreq=minfreq) termFrequency <- rowSums(as.matrix(myTdm)) termFrequency <- subset(termFrequency, termFrequency>=minfreq) library(ggplot2) qplot(names(termFrequency), termFrequency, , xlab="Terms") + coord_flip() library(wordcloud) m <- as.matrix(myTdm) wordFreq <- sort(rowSums(m), decreasing=TRUE) v <- which(wordFreq>5) wordFreq <- wordFreq[v] grayLevels <- gray( (wordFreq+10) / (max(wordFreq)+10) ) wordcloud(words=names(wordFreq), freq=wordFreq, min.freq=3, random.order=F,colors=grayLevels) qplot(names(termFrequency), termFrequency, stat = "identity", geom="bar", xlab="Terms") + coord_flip()

Term-document matrix generation, Frequency analysis and word-cloud generation Word cloud is the pictorial representation of the frequency of different words. In this example, word “iphon” (which is stemmed version of iphone) has the highest frequency. Size of a word in the word cloud is proportional to its frequency in the data.

Clustering of words Total number of documents (tweets) in the dataset is 485 We use hierarchical clustering algorithm with ward linkage method Number of clusters are taken as 3 (shown in red blocks in the figure) myTdm2 <- removeSparseTerms(myTdm, sparse=0.95) ## those terms which are not present in 95% or more documents are removed (sparse terms) m2 <- as.matrix(myTdm2) distMatrix <- dist(scale(m2)) fit <- hclust(distMatrix, method="ward") plot(fit) rect.hclust(fit, k=3)