21 Recipes for Mining Twitter [Social Network Analysis] 2013. 3 Hoon-Young Jung.

Slides:



Advertisements
Similar presentations
LiNC Developer Meetup Welcome!. Our job is to make your life easier APIs Tools and workflow Documentation Stay in touch: developers.lithium.com Join the.
Advertisements

Overview of Twitter API Nathan Liu. Twitter API Essentials Twitter API is a Representational State Transfer(REST) style web services exposed over HTTP(S).
Regular Expressions Pattern and Match objects Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Practical Perfect Hashing for very large Key-Value Databases Amjad Daoud, Ph.D.
Chapter Modules CSC1310 Fall Modules Modules Modules are the highest level program organization unit, usually correspond to source files and.
Building a Web Crawler in Python Frank McCown Harding University Spring 2013 This work is licensed under a Creative Commons Attribution-NonCommercial-
Python 3 March 15, NLTK import nltk nltk.download()
CSCI 6962: Server-side Design and Programming Input Validation and Error Handling.
Python regular expressions. “Some people, when confronted with a problem, think ‘I know, I'll use regular expressions.’ Now they have two problems.”
Mining twitter 1.9, 김종명. 1.9 Making Robust Twitter Requests Problem –You want to write a long-running script that harvests large amounts.
Problem Solving 5 Using Java API for Searching and Sorting Applications ICS-201 Introduction to Computing II Semester 071.
Inverted Indexing for Text Retrieval Chapter 4 Lin and Dyer.
101 WHAT THE TWEET? An introduction to the social network. Tweet ? #Tw101VPA Margaret Jennifer
{ Trends in Social Network M. Tech Project Presentation By : Pranay Agarwal 2008CS50220 Guides : Amitabha Bagchi Maya Ramanath.
PSRC Technology Integration Team TWITTER 101.  Twitter is a social networking tool or microblog.  It is composed of short text, pictures, and URLs called.
Distant Supervision for Emotion Classification in Twitter posts 1/17.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Your life, your way. having an online presence Your online presence should be an extension of you Brand YOU Lots of social media options out there Twitter.
Twitter The Basics. What is Twitter? Tweets are: 140 characters or less Quick to follow and view updates Used to share links, photos, videos, music,hot.
Social Media Marketing: From Mystery to Mastery Tweet and Connect Presented by Linnea Blair Advisors On Target June 16, 2011.
22C:19 Discrete Structures Trees Spring 2014 Sukumar Ghosh.
CS 206 Introduction to Computer Science II 10 / 14 / 2009 Instructor: Michael Eckmann.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
Visualization Tools for Twitter A review and analysis of visualization tools in the Twitter domain By Joseph Vincze.
Contour Trees CSE Han-Wei Shen. Level Sets Level set: Level sets is also called Isolines for n=2, isosurface for n=3, or isocontours in general.
 1 String and Data Processing. Sequential Processing Processing each element in a sequence for e in [1,2,3,4]: print e for c in “hello”: print c for.
Using Social Media to Communicate and Support Your School A Closer Look at Twitter.
Methods in Computational Linguistics II Queens College Lecture 7: Structuring Things.
JavaScript, Fourth Edition
Introduction to Computing Using Python Regular expressions Suppose we need to find all addresses in a web page How do we recognize addresses?
Modules and Decomposition UW CSE 190p Summer 2012 download examples from the calendar.
ASP.NET Programming with C# and SQL Server First Edition Chapter 5 Manipulating Strings with C#
Module 6 – Redirections, Pipes and Power Tools.. STDin 0 STDout 1 STDerr 2 Redirections.
Microblogs: Information and Social Network Huang Yuxin.
Creating Databases for Web Applications Twitter example Classwork/homework: Projects.
Working with Forms and Regular Expressions Validating a Web Form with JavaScript.
Project Introduction Knowledge Management Social Network Analysis Twitter, Tweets Small Messages – Natural Language Processing (AI) – Search, Patterns.
Tweet Training Using Twitter for Class. Basic Steps Set up a new account (recommended) or use your regular account. Note: if you are creating a second.
Facebook API Kelly Orser. Client Libraries Client libraries will simplify the calls to the platform by reducing the amount of code you have to write.
IDS 1 Extended Keyword Index & Improved Search for Semantic e-Catalog 이동주.
SSE3 Hypertext concepts 1. Agenda Pioneers and evolution Hypermedia – Modern hypermedia technology – Structure domains Architectural evolution The project.
Design Exercise UW CSE 190p Summer 2012 download examples from the calendar.
Python – May 16 Recap lab Simple string tokenizing Random numbers Tomorrow: –multidimensional array (list of list) –Exceptions.
 Packages:  Scrapy, Beautiful Soup  Scrapy  Website  
Dominique Renault. > Groups Groups - A group can be set up by any user and can be set to private. These are generally used by smaller groups of people.
MINING TWITTER 1.7 Visualizing a Graph of Retweet Relationships.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Graphs Reference:
Twitter anyone? Sue Newell Chief Operating Officer Faculty of Health and Social Sciences Leeds Metropolitan University.
Linked Data & Semantic Web Technology Development of Twitter Applications Part 4. Timeline and Tweet Dr. Myungjin Lee.
Winter 2016CISC101 - Prof. McLeod1 CISC101 Reminders Quiz 3 this week – last section on Friday. Assignment 4 is posted. Data mining: –Designing functions.
Status Report Hans Wenzel Geant4 Validation repository weekly meeting 18 th May 2016.
%GetTweet - A New SAS Macro To Fetch and Summarize Tweets Satish Garla Goutam Chakraborty Oklahoma State University.
16BIT IITR Data Collection Module A web crawler (also known as a web spider or web robot) is a program or automated script which browses the World Wide.
Lists/Dictionaries. What we are covering Data structure basics Lists Dictionaries Json.
Exam Friday April 11. MongoDB Specifics Find() to Query db.collection.find(, ) db.collection.find{{select conditions}, {project columns}) Selection conditions:
Java Basics Regular Expressions.  A regular expression (RE) is a pattern used to search through text.  It either matches the.
The Simple Corpus Tool Martin Weisser Research Center for Linguistics & Applied Linguistics Guangdong University of Foreign Studies
Tracing An Algorithm for Strongly Connected Components that uses Depth First Search Graph obtained from Text, page a-al: Geetika Tewari.
Collection Management (Tweets) Final Presentation
Containers and Lists CIS 40 – Introduction to Programming in Python
NETVIZZ Facebook
Map Reduce.
Associative Query Answering via Query Feature Similarity
Feature Extraction on Twitter Streaming data using Spark RDD
Speaker:Xu Bo Time:2011/9/7.
Design Exercise UW CSE 160 Winter 2016.
21 Recipes for Mining Twitter
StormRider: Harnessing “Storm” for Social Networks
Application Discovery,
Presentation transcript:

21 Recipes for Mining Twitter [Social Network Analysis] Hoon-Young Jung

1.5 Extracting a Retweet’s Origins  Problem –You want to extract the originating source from a retweet. ( 리트윗된 글의 출처를 알고 싶다. )  Solution –If the tweet’s retweet_count field is greater than 0, extract name out of t he tweet’s user ( 트윗의 retweet_count 필드가 0 보다 큰 경우, 사용자 이름을 추출 ) –field; also parse the text of the tweet with a regular expression. 또한 정규식 표현으로 트윗의 텍스트 분석

1.5 Extracting a Retweet’s Origins  Example Extracting retweet origins

1.5 Extracting a Retweet’s Origins  Example Extracting retweet origins # Also, inspect the tweet for the presence of "legacy" retweet # patterns such as "RT" and "via" try: rt_origins += [ mention.strip() for mention in rt_patterns.findall(tweet['text'])[0][1].split() ] except IndexError, e: pass # Filter out any duplicates return for rto in rt_origins])) # Also, inspect the tweet for the presence of "legacy" retweet # patterns such as "RT" and "via" try: rt_origins += [ mention.strip() for mention in rt_patterns.findall(tweet['text'])[0][1].split() ] except IndexError, e: pass # Filter out any duplicates return for rto in rt_origins])) def get_rt_origins(tweet): # Regex adapted from # regular-expression-for-retweets rt_patterns = re.IGNORECASE) rt_origins = [] # Inspect the tweet to see if was produced with /statuses/retweet/:id # See if tweet.has_key('retweet_count'): if tweet['retweet_count'] > 0: rt_origins += [ tweet['user']['name'].lower() ] def get_rt_origins(tweet): # Regex adapted from # regular-expression-for-retweets rt_patterns = re.IGNORECASE) rt_origins = [] # Inspect the tweet to see if was produced with /statuses/retweet/:id # See if tweet.has_key('retweet_count'): if tweet['retweet_count'] > 0: rt_origins += [ tweet['user']['name'].lower() ]

1.5 Extracting a Retweet’s Origins  Example Extracting retweet origins # Also, inspect the tweet for the presence of "legacy" retweet # patterns such as "RT" and "via" try: rt_origins += [ mention.strip() for mention in rt_patterns.findall(tweet['text'])[0][1].split() ] except IndexError, e: pass # Filter out any duplicates return for rto in rt_origins])) # Also, inspect the tweet for the presence of "legacy" retweet # patterns such as "RT" and "via" try: rt_origins += [ mention.strip() for mention in rt_patterns.findall(tweet['text'])[0][1].split() ] except IndexError, e: pass # Filter out any duplicates return for rto in rt_origins])) def get_rt_origins(tweet): # Regex adapted from # regular-expression-for-retweets rt_patterns = re.IGNORECASE) rt_origins = [] # Inspect the tweet to see if was produced with /statuses/retweet/:id # See if tweet.has_key('retweet_count'): if tweet['retweet_count'] > 0: rt_origins += [ tweet['user']['name'].lower() ] def get_rt_origins(tweet): # Regex adapted from # regular-expression-for-retweets rt_patterns = re.IGNORECASE) rt_origins = [] # Inspect the tweet to see if was produced with /statuses/retweet/:id # See if tweet.has_key('retweet_count'): if tweet['retweet_count'] > 0: rt_origins += [ tweet['user']['name'].lower() ]

1.5 Extracting a Retweet’s Origins  Example Extracting retweet origins # Also, inspect the tweet for the presence of "legacy" retweet # patterns such as "RT" and "via" try: rt_origins += [ mention.strip() for mention in rt_patterns.findall(tweet['text'])[0][1].split() ] except IndexError, e: pass # Filter out any duplicates return for rto in rt_origins])) # Also, inspect the tweet for the presence of "legacy" retweet # patterns such as "RT" and "via" try: rt_origins += [ mention.strip() for mention in rt_patterns.findall(tweet['text'])[0][1].split() ] except IndexError, e: pass # Filter out any duplicates return for rto in rt_origins])) def get_rt_origins(tweet): # Regex adapted from # regular-expression-for-retweets rt_patterns = re.IGNORECASE) rt_origins = [] # Inspect the tweet to see if was produced with /statuses/retweet/:id # See if tweet.has_key('retweet_count'): if tweet['retweet_count'] > 0: rt_origins += [ tweet['user']['name'].lower() ] def get_rt_origins(tweet): # Regex adapted from # regular-expression-for-retweets rt_patterns = re.IGNORECASE) rt_origins = [] # Inspect the tweet to see if was produced with /statuses/retweet/:id # See if tweet.has_key('retweet_count'): if tweet['retweet_count'] > 0: rt_origins += [ tweet['user']['name'].lower() ]

1.5 Extracting a Retweet’s Origins  Example Extracting retweet origins if __name__ == '__main__': # A mocked up array of tweets for purposes of illustration. # Assume tweets have been fetched from the /search resource or elsewhere. tweets = \ [ { 'text' : at #w00t' #... more tweet fields... }, { 'text' : example code at #w00t', 'retweet_count' : 1, 'user' : { 'name' : 'ptwobrussell‘ #... more user fields... } #... more tweet fields... }, #... more tweets... ] for tweet in tweets: print get_rt_origins(tweet) if __name__ == '__main__': # A mocked up array of tweets for purposes of illustration. # Assume tweets have been fetched from the /search resource or elsewhere. tweets = \ [ { 'text' : at #w00t' #... more tweet fields... }, { 'text' : example code at #w00t', 'retweet_count' : 1, 'user' : { 'name' : 'ptwobrussell‘ #... more user fields... } #... more tweet fields... }, #... more tweets... ] for tweet in tweets: print get_rt_origins(tweet)

1.6 Looking Up the Trending Topics  Problem –You want to construct and analyze a graph data structure of retweet rela tionships for a set of query results. ( 쿼리 결과 집합에 대한 리트윗 관계 데이터 구조 그래프를 구축하고 분 석하고 싶다. )  Solution –Query for the topic, extract the retweet origins, and then use the Networ kX package to construct a graph to analyze. ( 주제에 대한 쿼리는 리트 윗 출처를 추출하고 분석 할 수 있는 그래프를 생성 할 NetworkX 패키지를 사용합니다. )

1.6 Looking Up the Trending Topics  Example Creating a graph of retweet relationships # -*- coding: utf-8 -*- import sys import json import twitter import networkx as nx from recipe__get_rt_origins import get_rt_origins def create_rt_graph(tweets): g = nx.DiGraph() for tweet in tweets: rt_origins = get_rt_origins(tweet) if not rt_origins: continue for rt_origin in rt_origins: g.add_edge(rt_origin.encode('ascii', 'ignore'), tweet['from_user'].encode('ascii', 'ignore'), {'tweet_id': tweet['id']} ) return g # -*- coding: utf-8 -*- import sys import json import twitter import networkx as nx from recipe__get_rt_origins import get_rt_origins def create_rt_graph(tweets): g = nx.DiGraph() for tweet in tweets: rt_origins = get_rt_origins(tweet) if not rt_origins: continue for rt_origin in rt_origins: g.add_edge(rt_origin.encode('ascii', 'ignore'), tweet['from_user'].encode('ascii', 'ignore'), {'tweet_id': tweet['id']} ) return g

1.6 Looking Up the Trending Topics  Example Creating a graph of retweet relationships if __name__ == '__main__': # Your query Q = ' '.join(sys.argv[1]) # How many pages of data to grab for the search results MAX_PAGES = 15 # How many search results per page RESULTS_PER_PAGE = 100 # Get some search results for a query twitter_search = twitter.Twitter(domain='search.twitter.com') search_results = [] for page in range(1,MAX_PAGES+1): search_results.append( twitter_search.search(q=Q, rpp=RESULTS_PER_PAGE, page=page) # tweepy.api.search(q=Q, rpp=RESULTS_PER_PAGE, page=page) ) # result_list = tweepy. api.search(q=Q, rpp=RESULTS_PER_PAGE, page=page) # search_result.extend(result_list) all_tweets = [tweet for page in search_results for tweet in page['results']] # Build up a graph data structure g = create_rt_graph(all_tweets) # Print out some stats print >> sys.stderr, "Number nodes:", g.number_of_nodes() print >> sys.stderr, "Num edges:", g.number_of_edges() print >> sys.stderr, "Num connected components:", len(nx.connected_components(g.to_undirected())) print >> sys.stderr, "Node degrees:", sorted(nx.degree(g)) if __name__ == '__main__': # Your query Q = ' '.join(sys.argv[1]) # How many pages of data to grab for the search results MAX_PAGES = 15 # How many search results per page RESULTS_PER_PAGE = 100 # Get some search results for a query twitter_search = twitter.Twitter(domain='search.twitter.com') search_results = [] for page in range(1,MAX_PAGES+1): search_results.append( twitter_search.search(q=Q, rpp=RESULTS_PER_PAGE, page=page) # tweepy.api.search(q=Q, rpp=RESULTS_PER_PAGE, page=page) ) # result_list = tweepy. api.search(q=Q, rpp=RESULTS_PER_PAGE, page=page) # search_result.extend(result_list) all_tweets = [tweet for page in search_results for tweet in page['results']] # Build up a graph data structure g = create_rt_graph(all_tweets) # Print out some stats print >> sys.stderr, "Number nodes:", g.number_of_nodes() print >> sys.stderr, "Num edges:", g.number_of_edges() print >> sys.stderr, "Num connected components:", len(nx.connected_components(g.to_undirected())) print >> sys.stderr, "Node degrees:", sorted(nx.degree(g))

1.6 Looking Up the Trending Topics  Example Creating a graph of retweet relationships if __name__ == '__main__': # Your query Q = ' '.join(sys.argv[1]) # How many pages of data to grab for the search results MAX_PAGES = 15 # How many search results per page RESULTS_PER_PAGE = 100 # Get some search results for a query twitter_search = twitter.Twitter(domain='search.twitter.com') search_results = [] for page in range(1,MAX_PAGES+1): search_results.append( twitter_search.search(q=Q, rpp=RESULTS_PER_PAGE, page=page) # tweepy.api.search(q=Q, rpp=RESULTS_PER_PAGE, page=page) ) # result_list = tweepy. api.search(q=Q, rpp=RESULTS_PER_PAGE, page=page) # search_result.extend(result_list) all_tweets = [tweet for page in search_results for tweet in page['results']] # Build up a graph data structure g = create_rt_graph(all_tweets) # Print out some stats print >> sys.stderr, "Number nodes:", g.number_of_nodes() print >> sys.stderr, "Num edges:", g.number_of_edges() print >> sys.stderr, "Num connected components:", len(nx.connected_components(g.to_undirected())) print >> sys.stderr, "Node degrees:", sorted(nx.degree(g)) if __name__ == '__main__': # Your query Q = ' '.join(sys.argv[1]) # How many pages of data to grab for the search results MAX_PAGES = 15 # How many search results per page RESULTS_PER_PAGE = 100 # Get some search results for a query twitter_search = twitter.Twitter(domain='search.twitter.com') search_results = [] for page in range(1,MAX_PAGES+1): search_results.append( twitter_search.search(q=Q, rpp=RESULTS_PER_PAGE, page=page) # tweepy.api.search(q=Q, rpp=RESULTS_PER_PAGE, page=page) ) # result_list = tweepy. api.search(q=Q, rpp=RESULTS_PER_PAGE, page=page) # search_result.extend(result_list) all_tweets = [tweet for page in search_results for tweet in page['results']] # Build up a graph data structure g = create_rt_graph(all_tweets) # Print out some stats print >> sys.stderr, "Number nodes:", g.number_of_nodes() print >> sys.stderr, "Num edges:", g.number_of_edges() print >> sys.stderr, "Num connected components:", len(nx.connected_components(g.to_undirected())) print >> sys.stderr, "Node degrees:", sorted(nx.degree(g))