Presentation is loading. Please wait.

Presentation is loading. Please wait.

Laboratory for InterNet Computing CSCE 561 Social Media Projects Ryan Benton October 8, 2012.

Similar presentations


Presentation on theme: "Laboratory for InterNet Computing CSCE 561 Social Media Projects Ryan Benton October 8, 2012."— Presentation transcript:

1 Laboratory for InterNet Computing CSCE 561 Social Media Projects Ryan Benton October 8, 2012

2 Laboratory for InterNet Computing Social Media 140 million daily tweets 30 billion pieces of content shared each month Sources: Facebook; Twitter; CTIA 153 billion US SMS messages in 2009

3 Laboratory for InterNet Computing Social Media Processing

4 Laboratory for InterNet Computing Twitter Tweets – User Sender information – Name – Display name – Location – Follower and friend counts If it directed to other users If retweet, who from – Tweet The message Hashtags Date and Time Media Information

5 Laboratory for InterNet Computing What are Hashtags? The # symbol, called a hashtag, is used to mark keywords or topics in a Tweet. It was created organically by Twitter users as a way to categorize messages.

6 Laboratory for InterNet Computing Representation Can convert the social media into graphs – Homogenous One node type One link type – Heterogeneous One or more node types One or more link types Requirement – Either the links or the nodes (or both) must have more than one type.

7 Laboratory for InterNet Computing Nodes Nodes represent an object – Examples Users Concepts Hashtags Locations – May have multiple attributes describing object

8 Laboratory for InterNet Computing Links Relationships between nodes May have more than one attribute

9 Laboratory for InterNet Computing Visualize

10 Laboratory for InterNet Computing Visualize

11 Laboratory for InterNet Computing Problems Identifying relationships between hashtags in Twitter Data Identify (Generate) Important Keywords from Tweets

12 Laboratory for InterNet Computing Identifying relationships between hashtags in Twitter Data

13 Laboratory for InterNet Computing The idea If we have a collection normal associations of hashtags or hashtags that are usually used together. Will we be able to identify a situation developing by analyzing a “strange” association?

14 Laboratory for InterNet Computing Research Problem The main goal of the project is to find common association of entities or groups of “real world” concepts, using a graph structure of hashtags 1.Cluster the hashtags to form group of entities and find out inter-cluster associations. 2.Given a collection of hashtags with frequency and user information, can we identify a change in underlying structure from time t1 to time t2.

15 Laboratory for InterNet Computing Project 1: Cluster Hashtags into Entities Can we use a underlying graph structure to identify normal associations. If so, can it be used identify an association that is not normal eg: #UTAustin evacuated due to #Bombthreat

16 Laboratory for InterNet Computing Project 2: Analyze the transition between events If we have a collection of hashtags from a emergency event, eg: Hurricane, Forest Fire If we also have collection of hashtags before the event happened Can we identify the transition if hashtags, like frequency or associations?

17 Laboratory for InterNet Computing Identify (Generate) Important Keywords

18 Laboratory for InterNet Computing Why? Hashtags not sufficient Example – A tree just flew into my house during #hurricane Isaac

19 Laboratory for InterNet Computing Employ Keyword Selection Methods to Find “Good” Keywords Multiple methods – You can choose/research one of your choice. Discuss two – “CMore Approach” – “Shixian Chu Approach”

20 Laboratory for InterNet Computing CMore NSF CMORE” Filter Approach – Generated as part of NSF Concept Candidate List – First, generated that corresponds to all phrases with one, two, three, and four words. Phrases are not allowed to span from one sentence to another.

21 Laboratory for InterNet Computing CMore, cont. Filter Steps – Probabilistic filter uses various concept frequencies to determine whether or not a concept is of interest. The filters that it uses are iterative in nature. Concepts of length one are filtered first, then concepts of length two and so on. Several functions that measure the frequency of a concept relative to its prefix and suffix are defined. Utilizes Thresholds Filtering rules are formed by applying certain minimum threshold to the values of these functions. Once concepts of all lengths are processed using these rules, the remaining concepts are the relevant ones according to the probabilistic filter.

22 Laboratory for InterNet Computing CMore, cont. Filter Steps – Stop words filter. IF phrase contains word in stop word list then that concept is removed. – Entity type concepts filter Therefore, those concepts that do not parse to a noun phrase are discarded – Commonality filter Applied only to candidate concepts of length one and two words. Comparing the frequency with which a concept appears in a document to the frequency with which that concept appears in the Reuters [5] corpus.

23 Laboratory for InterNet Computing Shixian Chu’s Approach

24 Laboratory for InterNet Computing Parent-Network New Jaguar car (3,0) New jaguar (3,0) new (3,0) Jaguar (0,1) (1,0) (2,0) (3,1) car (0,2) (1,1) (2,1) (3,2) sale (0,3) (1,2) model (2,2) Used Jaguar (0,0) Used Jaguar car sale (0,0) L L R L R R Root node used (0,0) R Jaguar car (0,1) (1,0) (2,0) (3,1) Car sale (0,2) (1,1) Car model (2,1) Used Jaguar car (0,0) Jaguar car sale (0,1) (1,1) Jaguar car model (2,0) L R L R R L L LR L R R L

25 Laboratory for InterNet Computing Simplified Parent-Network Root node Jaguar car (0,1) (1,0) (2,0) Jaguar car sale (0,1) (1,1) Jaguar car model (2,0) Used Jaguar car sale (0,0)

26 Laboratory for InterNet Computing Parent-Network-based Key Phrase Extraction Step 1: Document pruning. – Sentence boundaries are marked and non-word tokens are stripped. Step 2: Document stemming. Step 3: Creating Parent-Network. Step 4: Computing logical frequency. – The logical frequency = (physical_frequency - the logical_frequency of all its ancestors that have been accepted as key phrases). – If no parents, the logical frequency = physical frequency. – Key phrase if logical frequency >= frequency threshold of this level. – The order for computation is from higher level to lower level (parent to child).

27 Laboratory for InterNet Computing Phrase Extraction -- catch. Designed to work on documents and/or collection of documents – Tweets are very small

28 Laboratory for InterNet Computing Logical Frequency Arithmetic Logical Frequency Entropy-based Logical Frequency

29 Laboratory for InterNet Computing Solution Create “tweet” collections – Randomly select X hashtags – For each hashtag, group tweets by time Hour, day or week – Each hashtag/time group is now a collection

30 Laboratory for InterNet Computing Evaluation Test impact of changing – Number of hashtags – Time period used to group – Modifying threshold values What is impact on number of keywords? How much overlap? Does the results look reasonable?

31 Laboratory for InterNet Computing Resources Twitter Collection Code – Need to check availability – If not, fairly straightforward to implement. Database Schema – MySQL

32 Laboratory for InterNet Computing Thank-you Questions?


Download ppt "Laboratory for InterNet Computing CSCE 561 Social Media Projects Ryan Benton October 8, 2012."

Similar presentations


Ads by Google