Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang.

Slides:



Advertisements
Similar presentations
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Advertisements

Chapter 5: Introduction to Information Retrieval
Relevant characteristics extraction from semantically unstructured data PhD title : Data mining in unstructured data Daniel I. MORARIU, MSc PhD Supervisor:
Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.
Linking Named Entity in Tweets with Knowledge Base via User Interest Modeling Date : 2014/01/22 Author : Wei Shen, Jianyong Wang, Ping Luo, Min Wang Source.
Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.
Self Organization of a Massive Document Collection
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Where Do You Go for Biomedical Funding? Yi Liu, Ahmet Altay.
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman University of Oxford ICCV 2003.
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Dimension of Meaning Author: Hinrich Schutze Presenter: Marian Olteanu.
Xiaomeng Su & Jon Atle Gulla Dept. of Computer and Information Science Norwegian University of Science and Technology Trondheim Norway June 2004 Semantic.
Lecture 09 Clustering-based Learning
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department.
Text mining.
IAT Text ______________________________________________________________________________________ SCHOOL OF INTERACTIVE ARTS + TECHNOLOGY [SIAT]
Self-organizing Maps Kevin Pang. Goal Research SOMs Research SOMs Create an introductory tutorial on the algorithm Create an introductory tutorial on.
Artificial Neural Networks Dr. Abdul Basit Siddiqui Assistant Professor FURC.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
A two-stage approach for multi- objective decision making with applications to system reliability optimization Zhaojun Li, Haitao Liao, David W. Coit Reliability.
A Scalable Self-organizing Map Algorithm for Textual Classification: A Neural Network Approach to Thesaurus Generation Dmitri G. Roussinov Department of.
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comparison of SOM Based Document Categorization Systems.
Transfer Learning Task. Problem Identification Dataset : A Year: 2000 Features: 48 Training Model ‘M’ Testing 98.6% Training Model ‘M’ Testing 97% Dataset.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al.
Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’Informazione Università.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Extracting meaningful labels for WEBSOM text archives Advisor.
Wei Feng , Jiawei Han, Jianyong Wang , Charu Aggarwal , Jianbin Huang
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Unsupervised Learning of Visual Sense Models for Polysemous Words Kate Saenko Trevor Darrell Deepak.
Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
Prediction of Influencers from Word Use Chan Shing Hei.
IAT Text ______________________________________________________________________________________ SCHOOL OF INTERACTIVE ARTS + TECHNOLOGY [SIAT]
Clustering C.Watters CS6403.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to database visualization and exploration.
1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,
Clustering More than Two Million Biomedical Publications Comparing the Accuracies of Nine Text-Based Similarity Approaches Boyack et al. (2011). PLoS ONE.
NEW EVENT DETECTION AND TOPIC TRACKING STEPS. PREPROCESSING Removal of check-ins and other redundant data Removal of URL’s maybe Stemming of words using.
Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan,
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Mining massive document collections by the WEBSOM method Presenter : Yu-hui Huang Authors :Krista Lagus,
E VENT D ETECTION USING A C LUSTERING A LGORITHM Kleisarchaki Sofia, University of Crete, 1.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
A code-centric cluster-based approach for searching online support forums for programmers Christopher Scaffidi, Christopher Chambers, Sheela Surisetty.
Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE Bruno Pinheiro Renato Correa
Natural Language Processing Topics in Information Retrieval August, 2002.
CS307P-SYSTEM PRACTICUM CPYNOT. B13107 – Amit Kumar B13141 – Vinod Kumar B13218 – Paawan Mukker.
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman University of Oxford ICCV 2003.
Mining Tag Semantics for Social Tag Recommendation Hsin-Chang Yang Department of Information Management National University of Kaohsiung.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.
3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.
IR 6 Scoring, term weighting and the vector space model.
Other Applications of Energy Minimzation
Self-Organizing Maps for Content-Based Image Database Retrieval
Mining the Data Charu C. Aggarwal, ChengXiang Zhai
Presented by: Prof. Ali Jaoua
Topic Oriented Semi-supervised Document Clustering
Self-organizing map numeric vectors and sequence motifs
Artificial Neural Networks
Presentation transcript:

Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang

Unsupervised, Clustering algorithm. Organize large document collections according to textual similarities. Create visible result for searching and exploring large document collections.

WEBSOM system Based on Self Organizing Map. Generate topic map for documents. Explore large documents just like explore Google map.

What WEBSOM looks like?

Gap WEBSOM – Long document, static, long training time. Twitter – Short text, dynamic, streaming data How to adapt SOM to streaming Twitter data?

What our system looks like

Pipeline Detect Event Build Dictionary Vectorize Tweets Reduce Dimension SOM Cluster Show the SOM map Detect Event

Only focus on unusual events. How to identify abnormal events on Twitter?

1. Similar to TCP’s congestion control mechanism. 2. Count the number of tweets in a moving window. 3. Weighted moving average and variance. 4. Threshold to determine whether it’s an event. Detect Event

Test Data

Time of PeakWhat’s happen? 4:11First Goal! 4:25Goal! X 3 in 3 minute 4:30Goal! 5:07Second Half Begin 5:25Goal! 5:35Goal! 5:46Goal! 5:50End! Detect Event

Build Dictionary Vectorize Tweets Reduce Dimension SOM Cluster Show the SOM map Detect Event Build Dictionary

Detect Event Build Dictionary Vectorize Tweets Reduce Dimension SOM Cluster Show the SOM map Build Dictionary

1. Remove stop words 2. Stemming – Snow Balls 3. Remove words whose occurrence less that 10% 4. Remove words whose occurrence greater that 50% Build Dictionary

1. Vector Space model 2. TF-IDF 3. Normalization Vectorize Tweets

Reduce Dimension Show the SOM map SOM Cluster Reduce Dimension Vectorize Tweets Build Dictionary Detect Event

Reduce Dimension Random Projection 1. No Training. 2. Matrix Operation. Based on Johnson-Lindenstrauss lemma

Show the SOM map SOM Cluster Reduce Dimension Vectorize Tweets Build Dictionary Detect Event SOM Cluster

What is SOM? Self-organization Map. SOM Cluster

Test Data

MethodRandom Projection Macro Accuracy(%) Micro Accuracy(%) Renato’s SOMNO6867 Our MethodYES6061 Conclusion: Random projection will result in losing precision. Hence the performance will decrease after dimension reduction. 20 Newsgroup Test

MethodRandom Projection Macro Accuracy(%) Micro Accuracy(%) Renato’s SOMNO6867 Our MethodYES6061 Matlab repeat Renato’s SOM NO6362 Matlab repeat Renato’s SOM YES Newsgroup Test

FIFA Data

Conclusion

Thanks for Watching Q & A