Download presentation
Presentation is loading. Please wait.
Published byBeverley Ferguson Modified over 9 years ago
1
Using to Save Lives Or, Using Digg to find interesting events. Presented by: Luis Zaman, Amir Khakpour, and John Felix
2
Outline Introduction Motivation and Approach Preprocessing Clustering Document Clustering Results Visualization Tools Demo!
3
Explanation Digg is a social web-media discovery tool based on user submitted content. 1 or 2 submissions a minute Half-life of “interest” is about a day Digg aggregates “interesting” content. But how do we find interesting Events and know their Themes?
4
Motivation Collaborative nature of Social Media can scour the WWW very thoroughly. But, this generates A LOT of data (you’ll see). It would be cool to find emergencies, or critical situations based on this collaborative media. Apple seems like a pretty good starting point.
5
Approach Get digg time series data for 3 months Cluster digg stories Visualize the time series. Show hot “topics” for a clicked point in the graph
6
Preprocessing Digg API REST API http://services.digg.com/stories/topic/apple?count=10 XML response Limitations 100 results per request 1 Hour of time series data Can’t go fast, or else.
7
Preprocessing Time Series Each digg is the event (only 100 at a time) Rows Each story’s digg count Columns Every hour (2,207 of them from August 08 – November 08) Clustering Rows Each story that was digged at any point in the time series Columns The words in the title and description of this story
8
Preprocessing - Challenges SLOW Really Dirty Data Different Formats of Data REALLY SLOW
9
Introduction to Document Clustering Challenges of clustering of text documents unlike structured data are: Volume Dimensionality Sparsity Complex semantics In information retrieval and text mining, text data is represented in a common representation model, e.g. Vector Space Model (VSM) Huge sparse matrix, we just store non-zero values Text Text documents are converted to A m,n where for m documents and total number of n words (or phrases), each element x i,j represents the frequency of the j th term in the i th document.
10
Clustering Dataset Number of stories (m) : 25470 Total number of unique words (n): 55557 Nonzero values: 469323 (0.03214%) Clustering using Cluto Software Using Kmeans, bisecting Kmeans Calculating Centroids and SSE A C++ program is run on “black”
11
Document Clustering by Optimizing Criterion Functions According to Zhao et.al, to have a good clustering for documents we can use some Criterion Function and use optimization to find clusters: Internal Criterion Functions (I) Maximizing the internal similarity function: External Criterion Functions (E) Minimizing the external similarity function: Hybrid Criterion Functions (H) Maximizing
12
Experiments SSE for I (K-Means vs Bisecting K-Means)
13
Visualization What we used jQuery Database query library for javascript PHP/MySQL Scripting language and database backend Google Visualization API Time Series Graph Zoomable Timepedia Chronoscope Clickable
14
Conclusions Success? Of course we think so Future Work Save lives? Better clustering Cleaner data More data Make it scalable, and dynamic On-line and on the fly?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.