Text Mining : Twitter.

Slides:

Advertisements

Similar presentations

Overview of Twitter API Nathan Liu. Twitter API Essentials Twitter API is a Representational State Transfer(REST) style web services exposed over HTTP(S).

Advertisements

Samsung Smart TV is a web-based application running on an application engine installed on digital TVs connected to the Internet.

1 Configuring Internet- related services (April 22, 2015) © Abdou Illia, Spring 2015.

ISBN Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl –(on reserve.

Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl Linux editors and commands (e.g.

CS 299 – Web Programming and Design Overview of JavaScript and DOM Instructor: Dr. Fang (Daisy) Tang.

UNIX By Darcy Tatlock. 1. Successful Log Into Unix To actively manipulate your website you need to be logged in. Without being logged in you cannot enter.

The Internet & The World Wide Web Notes

Filters using Regular Expressions grep: Searching a Pattern.

Form Handling, Validation and Functions. Form Handling Forms are a graphical user interfaces (GUIs) that enables the interaction between users and servers.

JavaScript Form Validation

Using the Unix Shell There is No ‘Undelete’. The Unix Shell “A Unix shell is a command-line interpreter or shell that provides a traditional user interface.

Document Type Definitions Kanda Runapongsa Dept. of Computer Engineering Khon Kaen University.

WEB API: WHY THEY MATTER ECOL 453/ Nirav Merchant

Client Scripting1 Internet Systems Design. Client Scripting2 n “A scripting language is a programming language that is used to manipulate, customize,

Chapter 8 Browsing and Searching the Web. Browsing and Searching the Web FAQs: – What’s a Web page? – What’s a URL? – How does a browser work? – How do.

Limits From the initial (HINARI) PubMed page, we will click on the Limits search option. Note also the hyperlinks to Advanced search and Help options.

BIF713 Additional Utilities. Linux Utilities  You have learned many Linux commands. Here are some more that you can use:  Data Manipulation (Reg Exps)

TWITTER DAY /07/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Slide No. 1 Slide No. 1 HTML and Web Publishing Continued CS 104 CS 104.

Lecture 5 Regular Expressions CSCI – 1900 Mathematics for Computer Science Fall 2014 Bill Pine.

GREP. Whats Grep? Grep is a popular unix program that supports a special programming language for doing regular expressions The grammar in use for software.

 Packages:  Scrapy, Beautiful Soup  Scrapy  Website  

-Joseph Beberman *Some slides are inspired by a PowerPoint presentation used by professor Seikyung Jung, which was derived from Charlie Wiseman.

Regular Expressions. What is it 4? Text searching & replacing Sequence searching (input, DNA) Sequence Tracking Machine Operation logic machines that.

1. 2 Regular Expressions Regular Expressions are found in Formal Language Theory and can be used to describe a class of languages called regular languages.

WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.

Creative Create Lists Elizabeth B. Thomsen Member Services Manager

Teacher Tube Teacher tube is a great source for any digital media to use with your class. It is free to sign up, and you have access to many different.

AdisInsight User Guide July 2015

DATA MIGRATION OF EXISTING TAX PAYER

JavaScript, Sixth Edition

KARES Demonstration.

Consuming OAuth Services in Alfresco Share

CS 330 Class 7 Comments on Exam Programming plan for today:

WikID installation/training

Regular Expressions.

Bash Introduction (adapted from chapters 1 and 2 of bash Cookbook by Albing, Vossing, & Newham) CPTE 440 John Beckett.

Web Development & Design Foundations with HTML5

AIM/education directory (Ed dir)

Sections Text Mining Plan Twitter API twitteR package

CONTENT MANAGEMENT SYSTEM CSIR-NISCAIR, New Delhi

IPOM and E-Booking.

Browsing and Searching the Web

Chapter 7 Text Input/Output Objectives

Subbu Allamaraju BEA Systems Inc

* Lecture # 7 Instructor: Rida Noor Department of Computer Science

All about social networking

CONTENT MANAGEMENT SYSTEM CSIR-NISCAIR, New Delhi

FTP - File Transfer Protocol

Arrays and files BIS1523 – Lecture 15.

Formal Language Theory

Data File Import / Export

Using Fetch to Upload Your Website Source Files

Web Programming– UFCFB Lecture 17

Intro to PHP & Variables

Guide To UNIX Using Linux Third Edition

Configuring Internet-related services

Getting Started: Amazon AWS Account Creation

Agenda OAuth Concepts Programming OAuth.

Lecture 16B: Instructions on how to use Hadoop on Amazon Web Services

Files Management – The interfacing

Designing and Using Normalization Rules

Part 1. Preparing for the exercises

CSCI The UNIX System Regular Expressions

Lab 8: Regular Expressions

Twitter Bot with NodeJS

Computer Network Information Center, Chinese Academy of Sciences

Murach's JavaScript and jQuery (3rd Ed.)

Presentation transcript:

Text Mining : Twitter

Agenda Making connection Fetching tweets/data Mining of tweets

Making connection Twitter API requires authentication since March 2013. Twitter has closed the access to version 1.0 of API. Current version is 1.1 (till date) You have to create a Twitter application to generate Twitter API Keys, Access Token and secret keys and so on. OAuth - An open protocol to allow secure authorization in a simple and standard method from web, mobile and desktop applications.(https://dev.twitter.com/oauth) JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. It is based on a subset of the JavaScript

Making connection (cont…) You must have a twitter account for that. To get Twitter Access keys, you need to create Twitter Application which is mandatory to access Twitter. Go to https://apps.twitter.com and log in, if necessary. Then create an application. Enter your Application Name, Description and your website address. You can leave the callback URL empty.

Making connection (cont…) A new application is created as shown below.

Making connection (cont…) Change permissions if necessary(depending if you want to just read, write or execute)

Making connection (cont…) Check - Allow this application to be used to Sign in with Twitter. Submit the form by clicking the Update Setting.

Making connection (cont…) Generate access token.

Making connection (cont…) Copy the consumer key (API key) and consumer secret from the screen into your application Example – (do not use these as these are not valid) Consumer key – “96KlLEJaRtrEXo3uvFBRMvFu2” Consumer secret – “NUzR0iJOgMQXWzaNeOZbXD5wHnmBslnRjcRX3my6xK3ryDCLOP”

Twitter Authentication using R library (httr) library(twitteR) library(RCurl) consumer_key <-‘gg34l45ll5m' consumer_secret <-‘kffknfkndnknknfnkfnknf' access_token <- ‘ffnfnfnfnehwe9r' access_secret <- ‘idiofhiehfiehfhe' setup_twitter_oauth (consumer_key, consumer_secret, access_token, access_secret) Do not use these key put your own key

Proxy Setting in R httr package function set_config(use_proxy(url='192.168.15.5',8000,NULL,NULL)) set_config(use_proxy(url='192.168.15.5',8000,NULL,NULL))

Fetching Tweets/Data from Twitter searchTwitter() For more information on different functions: http://cran.r-project.org/web/packages/twitteR/twitteR.pdf tweets <- searchTwitter("iphone 6",n=500,lang="en") ## number of tweets returned may be less than 500 df <- twListToDF(tweets) iphone <- unique(as.character(df$text)) ## after removing duplicate tweets, there are (###) tweets in this example iphone <- gsub("\n"," ",iphone,fixed=TRUE) ## you can save the data in a file # name <- file(description="F:/sample.txt",open="w") name <- file(description="sample.txt",open="w") # will be saved in working directory. write(x=iphone,file=name) close(con=name) <html lang="en"> <html lang="en-US"> The first lang tag only specifies a language code. The second specifies a language code, followed by a country code. http://www.endmemo.com/program/R/gsub.php ( for gsub)

Anatomy of a Tweet 1 .profile pic https://media.twitter.com/best-practice/anatomy-of-a-tweet

Text Mining on Tweets (Pre-processing step first) Pre-processing of tweets is done first. library(tm) library(stringr) tweets <- scan("F:/sample.txt",what=character(0),sep="\n") tweets <- gsub("&"," ",tweets,fixed=TRUE) tweets <- gsub(">"," ",tweets,fixed=TRUE) tweets <- gsub("<"," ",tweets,fixed=TRUE) tweets <- gsub("\n"," ",tweets,fixed=TRUE) tweets <- gsub("\\n"," ",tweets,fixed=TRUE) tweets <- gsub("_","",tweets,fixed=TRUE) tweets <- gsub("[0-9]+","",tweets) tweets <- gsub("'","",tweets,fixed=TRUE) tweets <- tolower(tweets) Special symbol in html ( html Unicode , asci Numerical values For regular expression do not Fixed = true , exactly substitute (lower case upper case) In theoretical computer science and formal language theory, a regular expression (abbreviatedregex or regexp and sometimes called a rational expression[1][2]) is a sequence of charactersthat define a search pattern, mainly for use in pattern matching with strings, or string matching, i.e. "find and replace"-like operations. The concept arose in the 1950s, when the American mathematician Stephen Kleene formalized the description of a regular language, and came into common use with the Unix text processing utilities ed, an editor, and grep (global regular expression print), a filter.

Text Mining on Tweets (Pre-processing step first) reg <- "(^|[^@\\w])@(\\w{1,35})\\b" tweets <- gsub(reg,"",tweets) reg <- c("aaa(a)*","bbb(b)*","ccc(c)*","ddd(d)*","eee(e)*","fff(f)*","ggg(g)*","hhh(h)*","iii(i)*","jjj(j)*","kkk(k)*","lll(l)*","mmm(m)*","nnn( n)*", "ooo(o)*","ppp(p)*","qqq(q)*","rrr(r)*","sss(s)*","ttt(t)*","uuu(u)*","vvv(v)*","www(w)*","xxx(x)*","yyy(y)*","zzz(z)*") rep <- c("aa","bb","cc","dd","ee","ff","gg","hh","ii","jj","kk","ll","mm","nn","oo","pp","qq","rr","ss","tt","uu","vv","ww","xx","yy","zz") for(i in 1:26) { tweets <- str_replace_all(tweets,reg[i],rep[i]) } myCorpus <- Corpus(VectorSource(tweets)) myCorpus <- tm_map(myCorpus, removePunctuation) removeURL <- function(x) gsub("http[[:alnum:]]*", "", x) myCorpus <- tm_map(myCorpus, removeURL) myCorpus <- tm_map(myCorpus, removeWords, stopwords("english")) myCorpus <- tm_map(myCorpus, stemDocument) myCorpus <- tm_map(myCorpus, PlainTextDocument) To remove username (regular expression) @username No more than 2 consecutive character in the word. Stem – root word

Term-document matrix generation, Frequency analysis and word-cloud generation library(vegan) myTdm <- TermDocumentMatrix(myCorpus, control=list(wordLengths=c(3,Inf))) minfreq <- 20 findFreqTerms(myTdm, lowfreq=minfreq) termFrequency <- rowSums(as.matrix(myTdm)) termFrequency <- subset(termFrequency, termFrequency>=minfreq) library(ggplot2) qplot(names(termFrequency), termFrequency, , xlab="Terms") + coord_flip() library(wordcloud) m <- as.matrix(myTdm) wordFreq <- sort(rowSums(m), decreasing=TRUE) v <- which(wordFreq>5) wordFreq <- wordFreq[v] grayLevels <- gray( (wordFreq+10) / (max(wordFreq)+10) ) wordcloud(words=names(wordFreq), freq=wordFreq, min.freq=3, random.order=F,colors=grayLevels) qplot(names(termFrequency), termFrequency, stat = "identity", geom="bar", xlab="Terms") + coord_flip()

Term-document matrix generation, Frequency analysis and word-cloud generation Word cloud is the pictorial representation of the frequency of different words. In this example, word “iphon” (which is stemmed version of iphone) has the highest frequency. Size of a word in the word cloud is proportional to its frequency in the data.

Clustering of words Total number of documents (tweets) in the dataset is 485 We use hierarchical clustering algorithm with ward linkage method Number of clusters are taken as 3 (shown in red blocks in the figure) myTdm2 <- removeSparseTerms(myTdm, sparse=0.95) ## those terms which are not present in 95% or more documents are removed (sparse terms) m2 <- as.matrix(myTdm2) distMatrix <- dist(scale(m2)) fit <- hclust(distMatrix, method="ward") plot(fit) rect.hclust(fit, k=3)