Using to Save Lives Or, Using Digg to find interesting events. Presented by: Luis Zaman, Amir Khakpour, and John Felix.

Using to Save Lives Or, Using Digg to find interesting events. Presented by: Luis Zaman, Amir Khakpour, and John Felix

Outline Introduction Motivation and Approach Preprocessing Clustering Document Clustering Results Visualization Tools Demo!

Explanation  Digg is a social web-media discovery tool based on user submitted content.  1 or 2 submissions a minute  Half-life of “interest” is about a day  Digg aggregates “interesting” content.  But how do we find interesting Events and know their Themes?

Motivation  Collaborative nature of Social Media can scour the WWW very thoroughly.  But, this generates A LOT of data (you’ll see).  It would be cool to find emergencies, or critical situations based on this collaborative media.  Apple seems like a pretty good starting point.

Approach Get digg time series data for 3 months Cluster digg stories Visualize the time series. Show hot “topics” for a clicked point in the graph

Preprocessing  Digg API  REST API  http://services.digg.com/stories/topic/apple?count=10  XML response   Limitations  100 results per request  1 Hour of time series data  Can’t go fast, or else.

Preprocessing  Time Series  Each digg is the event (only 100 at a time)  Rows  Each story’s digg count  Columns  Every hour (2,207 of them from August 08 – November 08)  Clustering  Rows  Each story that was digged at any point in the time series  Columns  The words in the title and description of this story

Preprocessing - Challenges  SLOW  Really Dirty Data  Different Formats of Data  REALLY SLOW

Introduction to Document Clustering  Challenges of clustering of text documents unlike structured data are:  Volume  Dimensionality  Sparsity  Complex semantics  In information retrieval and text mining, text data is represented in a common representation model, e.g. Vector Space Model (VSM)  Huge sparse matrix, we just store non-zero values Text Text documents are converted to A m,n where for m documents and total number of n words (or phrases), each element x i,j represents the frequency of the j th term in the i th document.

Clustering  Dataset  Number of stories (m) : 25470  Total number of unique words (n): 55557  Nonzero values: 469323 (0.03214%)  Clustering using Cluto Software  Using Kmeans, bisecting Kmeans  Calculating Centroids and SSE  A C++ program is run on “black”

Document Clustering by Optimizing Criterion Functions  According to Zhao et.al, to have a good clustering for documents we can use some Criterion Function and use optimization to find clusters:  Internal Criterion Functions (I)  Maximizing the internal similarity function:  External Criterion Functions (E)  Minimizing the external similarity function:  Hybrid Criterion Functions (H)  Maximizing

Experiments  SSE for I (K-Means vs Bisecting K-Means)

Visualization  What we used  jQuery  Database query library for javascript  PHP/MySQL  Scripting language and database backend  Google Visualization API  Time Series Graph  Zoomable  Timepedia Chronoscope  Clickable

Conclusions  Success?  Of course we think so  Future Work  Save lives?  Better clustering  Cleaner data  More data  Make it scalable, and dynamic  On-line and on the fly?

Using to Save Lives Or, Using Digg to find interesting events. Presented by: Luis Zaman, Amir Khakpour, and John Felix.

Similar presentations

Presentation on theme: "Using to Save Lives Or, Using Digg to find interesting events. Presented by: Luis Zaman, Amir Khakpour, and John Felix."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Using to Save Lives Or, Using Digg to find interesting events. Presented by: Luis Zaman, Amir Khakpour, and John Felix.

Similar presentations

Presentation on theme: "Using to Save Lives Or, Using Digg to find interesting events. Presented by: Luis Zaman, Amir Khakpour, and John Felix."— Presentation transcript:

Similar presentations

About project

Feedback