Clustering tweets and webpages

Slides:

Advertisements

Similar presentations

Chapter 5: Introduction to Information Retrieval

Advertisements

TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.

Integrated Digital Event Web Archive and Library (IDEAL) and Aid for Curators Archive-It Partner Meeting Montgomery, Alabama Mohamed Farag & Prashant Chandrasekar.

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

Unsupervised learning

Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.

Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.

Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.

Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.

IR Models: Review Vector Model and Probabilistic.

Clustering Unsupervised learning Generating “classes”

Web Information Retrieval Projects Ida Mele. Rules Students can work in teams (max 3 people) The project must be delivered by the deadline that will be.

Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.

CS 5604 Spring 2015 Classification Xuewen Cui Rongrong Tao Ruide Zhang May 5th, 2015.

Advanced Multimedia Text Classification Tamara Berg.

Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.

Tag-based Social Interest Discovery

Web Archives, IDEAL, and PBL Overview Edward A. Fox Digital Library Research Laboratory Dept. of Computer Science Virginia Tech Blacksburg, VA, USA 21.

Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.

Medical Data Classifier undergraduate project By: Avikam Agur and Maayan Zehavi Advisors: Prof. Michael Elhadad and Mr. Tal Baumel.

Chapter 6: Information Retrieval and Web Search

1 Computing Relevance, Similarity: The Vector Space Model.

Solr Team CS5604: Cloudera Search in IDEAL Nikhil Komawar, Ananya Choudhury, Rich Gruss Tuesday May 5, 2015 Department of Computer Science Virginia Tech,

Distributed Information Retrieval Server Ranking for Distributed Text Retrieval Systems on the Internet B. Yuwono and D. Lee Siemens TREC-4 Report: Further.

CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

VIRGINIA TECH BLACKSBURG CS 4624 MUSTAFA ALY & GASPER GULOTTA CLIENT: MOHAMED MAGDY IDEAL Pages.

ProjFocusedCrawler CS5604 Information Storage and Retrieval, Fall 2012 Virginia Tech December 4, 2012 Mohamed M. G. Farag Mohammed Saquib Khan Prasad Krishnamurthi.

E VENT D ETECTION USING A C LUSTERING A LGORITHM Kleisarchaki Sofia, University of Crete, 1.

Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

CS791 - Technologies of Google Spring A Webbased Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.

Information Storage and Retrieval(CS 5604) Collaborative Filtering 4/28/2016 Tianyi Li, Pranav Nakate, Ziqian Song Department of Computer Science Blacksburg,

Big Data Processing of School Shooting Archives

Information Retrieval in Practice

Classification CS5604 Information Storage and Retrieval - Fall 2016

A Simple Approach for Author Profiling in MapReduce

Sentiment and Topic Analysis

CS6604 Digital Libraries Global Events Team Final Presentation

Collection Management (Tweets) Final Presentation

Tutorial: Big Data Algorithms and Applications Under Hadoop

Semantic Processing with Context Analysis

Collection Management Webpages

Presenters: Radha Krishnan, Sneha Mehta April 28, 2016

IDEALvr Team: Luciano Biondi, Omavi Walker, Dagmawi Yeshiwas

Collection Management

Common Crawl Mining Team: Brian Clarke, Tommy Dean, Ali Pasha, Casey Butenhoff Manager: Don Sanderson (Eastman Chemical Company) Client: Ken Denmark.

Extraction, aggregation and classification at Web Scale

CLA Team Final Presentation CS 5604 Information Storage and Retrieval

Text Classification CS5604 Information Retrieval and Storage – Spring 2016 Virginia Polytechnic Institute and State University Blacksburg, VA Professor:

VR4GETAR CS4624: Multimedia, Hypertext and Information Access

Trail Study Kevin Cianfarini, Shane Davies, Marshall Hansen, Andrew Eason … CS4624: Multimedia, Hypertext, and Information Access Instructor: Dr. Edward.

Clustering and Topic Analysis

CS 5604 Information Storage and Retrieval

CMPT 733, SPRING 2016 Jiannan Wang

Team FE Final Presentation

Collection Management Webpages Final Presentation

Representation of documents and queries

Event Trend Detector Ryan Ward, Skylar Edwards, Jun Lee, Stuart Beard, Spencer Su CS 4624 Multimedia, Hypertext, and Information Access Instructor: Edward.

Twitter Equity Firm Value

Information Storage and Retrieval

News Event Detection Website Joe Acanfora, Briana Crabb, Jeff Morris

Computational Linguistic Analysis of Earthquake Collections

Resource Recommendation for AAN

Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang

CMPT 733, SPRING 2017 Jiannan Wang

Presented by Nick Janus

Austin Karingada, Jacob Handy, Adviser : Dr

Presentation transcript:

Clustering tweets and webpages Saket vishwasrao Swapna thorve Social network Lijie tang May 3, 2016 Spring 2016 CS 5604 Information storage and retrieval Instructor : Dr. Edward Fox Virginia Polytechnic Institute and State University, Blacksburg, VA 24061

clustering

Clustering overview Feature Extraction K-means WSSE evaluation TF-IDF word2vec K-means WSSE evaluation HBase Schema Clustering and LDA result evaluation Future Work

Feature extraction TF-IDF Word2vec Represent documents as a bag of words model Vector dimension = Vocabulary size Word score = TF-IDF Word2vec Combine all tweets to a single document Train a neural network and extract vector representation of each word Document vector = Sum all vectors (for each word) in a document

Clustering with k-means Normalize data using ||L2|| norm Write to HDFS as part files. Compute Within Set Sum of Squares (WSSE) scores Tweet distribution

Comparison of TF-IDF vs. word2vec

WSSE Based evaluation

Running over different TwEET collections Clusters 541 #NAACPBombing 602 #Germanwings 668 #houstonflood 686 #Obamacare 694 #4thofjuly 1374 31189 283 5333 6627 1 7504 26297 899 36070 5529 2 18375 5785 2605 16899 762 3 3166 11979 3672 69555 13663 4 1009 8328 1488 36383 1427 5 2437 6459 5884 19302 3961 Total sizes Put pie charts Collection names Right align the data

Running over different TwEET collections

Hbase schema design Document ID Cluster no. Cluster label Probability String (Tweet-id/URL) Integer Float

Collections evaluated Tweets - 602, 668, 694, 541 Webpages - 686 (TODO)

Clustering evaluation using topic analysis data 1. Cluster labels 2. Hierarchy of mixed topics and clusters 3. Probability of a document belonging to the cluster Extract relevant topics per cluster using mean, frequency, deviation Create topic frequency matrix per cluster Combine document matrices Picture of matrix circle 2, sketches n stuff Outcome

Explanation + How many times a topic occurs in a cluster ? Extract relevant topics per cluster using mean, frequency, deviation Create topic frequency matrix per cluster Combine document matrices Doc id Cluster no Doc id Topics Probability + How many times a topic occurs in a cluster ? Cluster no Topics per cluster 1 T3, T2 2 T1, T2 Cluster no Topic frequency 1 T1=2, T5=2, T2=1, T3=1 2 T1=3, T3=3, T5=1, T2=5

results Collection 602 - tweets Cluster no (wordToVec) Topics Topic words Cluster labels Topic 3, Topic 4 crash,germanwings,pilot,victims,plane,Lufthansa germanwings,lubitz,copilot,flight,playing Germanwings crash and pilot Lubitz 1 Topic 1, Topic 5 germanwings,ripley,believe,click,bigareveal germanwings,world,charliehebdo,lives,garissa Germanwings news 2 Topic 4, Topic 5 Germanwings crash news 3 Topic 1, Topic 4 Pilot Lubitz of Germanwings 4 Topic 1, Topic 2 germanwings,copiloto,vctimas,accidente,vuelo Happenings and accidents 5

Hierarchy of mixed topics and clusters Collection 602 - tweets Pilot Lubitz of Germanwings Cluster 5 Topic 2 Topic 3 Cluster 4 Cluster 3 Cluster 0 Topic 1 Topic 4 Change the figure Happenings and accidents Pilot Lubitz of Germanwings Germanwings crash and pilot Lubitz Cluster 1 Cluster 2 Topic 5 Germanwings news Germanwings crash news

Hierarchy of mixed topics and clusters Collection 602 - tweets Cluster 0 Topic 1 Topic 5 Topic 2 Cluster 0 Cluster 2 Cluster 4 Change the figure Topic 3 Cluster 3, Cluster 5 Topic 4

Future work Use probabilistic models for clustering Clustering evaluation - Evaluate clusters with internal and external criteria - Implement Silhouette scoring in Spark-Scala - Establish ground truth for comparison of evaluation results (probabilities) Labeling of clusters - Label extraction from clusters words - Compare cluster labeling methods Clustering as a start point - Feed clustering results to LDA/classification/collaborative filtering for more accurate results. Reverse the approach Labelling clusters

Social network

objective To find the most relevant content given an User Query Clusters/Classification/Topic Modeling are content driven Social Network are User driven Query is user generated Use figures from your work rather than this one

assumptions 1. More RT = important tweets 2. More RT = important accounts 3. More follower = important accounts 4. Generated/Distributed by important accounts = important tweets 5. Higher RT ratio inside cluster = important tweets/accounts in the cluster 6. More follower inside cluster = important accounts in cluster Clusters could be content cluster or SN cluster

Importance factors (general) 1. Start from follower counts: IFaccount = Number of followers/Maximum follower count in SN 2. Tweet IF: IFtweet = SUM ( RT * IFaccount) for each RT 3. Account IF: IFaccount = SUM (IFtweet )/Tweet count Repeat 2, 3 until converge

Importance factors (cluster) ? Should we consider outside influence ? Two tweet with same RT network inside a cluster Are they the same important? In our approach Inside IF is calculated as the general approach Inside IF is more important than general IF Inside IF first General IF second Simplify and give example: thought experiment or really done. If really done give examples else drop this slide

Social Network Drawling. Clustering Based on Social Network Structure: Work Flow Data Collection Social Network Building Social Network Clustering Important Factor Calculation Twitter Crawling: build a crawler to scrapy the twitter account information. Twitter connections summarization: find the RT and Mention (@) between accounts. Calculation of the nodes-edge networks: nodes – accounts; edge – follow, RT, mention. Social Network Drawling. Clustering Based on Social Network Structure: Graph Clustering Given Nodes and Edges. Nodes Grouping Using Clustering result. Importance factor calculation based position in the social network (centroid or border of a cluster) Twitter Crawler: 1. User Information; 2. RT counts of each tweets. Program the RT, Mention Network Builder. Social Network Visualization. Clustering algorithm and programming. Work Key Activities Milestones

Tool used Crawl: Python -> tweepy SN build: Python Streaming tweets in real time SN build: Python Simple Visualization: Gephi. Could expand to D3. Make this slide clearer. Nice to know what is the dataset. Histogram of importance factors/scores. Pick up a few points from the graphs and explain them or show their scores. Supporting data for the graph shown here. Make a workflow diagram, because the slide is not clear.

acknowledgments Dr. Fox NSF for grant IIS – 1319578 IDEAL project GRAs – Sunshin Lee and Mohamed Magdy Farag Teams – Collection management, Solr, Front end, Topic Analysis Class – Discussions, suggestions

Questions? Thank you