Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang
Unsupervised, Clustering algorithm. Organize large document collections according to textual similarities. Create visible result for searching and exploring large document collections.
WEBSOM system Based on Self Organizing Map. Generate topic map for documents. Explore large documents just like explore Google map.
What WEBSOM looks like?
Gap WEBSOM – Long document, static, long training time. Twitter – Short text, dynamic, streaming data How to adapt SOM to streaming Twitter data?
What our system looks like
Pipeline Detect Event Build Dictionary Vectorize Tweets Reduce Dimension SOM Cluster Show the SOM map Detect Event
Only focus on unusual events. How to identify abnormal events on Twitter?
1. Similar to TCP’s congestion control mechanism. 2. Count the number of tweets in a moving window. 3. Weighted moving average and variance. 4. Threshold to determine whether it’s an event. Detect Event
Test Data
Time of PeakWhat’s happen? 4:11First Goal! 4:25Goal! X 3 in 3 minute 4:30Goal! 5:07Second Half Begin 5:25Goal! 5:35Goal! 5:46Goal! 5:50End! Detect Event
Build Dictionary Vectorize Tweets Reduce Dimension SOM Cluster Show the SOM map Detect Event Build Dictionary
Detect Event Build Dictionary Vectorize Tweets Reduce Dimension SOM Cluster Show the SOM map Build Dictionary
1. Remove stop words 2. Stemming – Snow Balls 3. Remove words whose occurrence less that 10% 4. Remove words whose occurrence greater that 50% Build Dictionary
1. Vector Space model 2. TF-IDF 3. Normalization Vectorize Tweets
Reduce Dimension Show the SOM map SOM Cluster Reduce Dimension Vectorize Tweets Build Dictionary Detect Event
Reduce Dimension Random Projection 1. No Training. 2. Matrix Operation. Based on Johnson-Lindenstrauss lemma
Show the SOM map SOM Cluster Reduce Dimension Vectorize Tweets Build Dictionary Detect Event SOM Cluster
What is SOM? Self-organization Map. SOM Cluster
Test Data
MethodRandom Projection Macro Accuracy(%) Micro Accuracy(%) Renato’s SOMNO6867 Our MethodYES6061 Conclusion: Random projection will result in losing precision. Hence the performance will decrease after dimension reduction. 20 Newsgroup Test
MethodRandom Projection Macro Accuracy(%) Micro Accuracy(%) Renato’s SOMNO6867 Our MethodYES6061 Matlab repeat Renato’s SOM NO6362 Matlab repeat Renato’s SOM YES Newsgroup Test
FIFA Data
Conclusion
Thanks for Watching Q & A