Using to Save Lives Or, Using Digg to find interesting events. Presented by: Luis Zaman, Amir Khakpour, and John Felix.

Slides:



Advertisements
Similar presentations
Clustering for web documents 1 박흠. Clustering for web documents 2 Contents Cluto Criterion Functions for Document Clustering* Experiments and Analysis.
Advertisements

A Vector Space Model for Automatic Indexing
Dimensionality Reduction PCA -- SVD
By: Mr Hashem Alaidaros MIS 211 Lecture 4 Title: Data Base Management System.
Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
Frequent Item Based Clustering M.Sc Student:Homayoun Afshar Supervisor:Martin Ester.
Information Retrieval in Practice
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
R for Research Data Analysis using R Day1: Basic R Baburao Kamble University of Nebraska-Lincoln.
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Dunja Mladenić Marko Grobelnik Jožef Stefan Institute, Slovenia.
Sparsity, Scalability and Distribution in Recommender Systems
Multimedia Databases Text II. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Text databases Image and video.
1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.
Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.
DASHBOARDS Dashboard provides the managers with exactly the information they need in the correct format at the correct time. BI systems are the foundation.
Overview of Search Engines
State of Connecticut Core-CT Project Query 4 hrs Updated 1/21/2011.
B.A. (Mahayana Studies) Introduction to Computer Science November March Office Tools A look at the main tools most computer users.
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
5/5/2005Toni Räikkönen Internet based data collection from enterprises using XML questionnaires and XCola engine CoRD Meeting May 11th 2005.
Databases C HAPTER Chapter 10: Databases2 Databases and Structured Fields  A database is a collection of information –Typically stored as computer.
Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.
Cluto – Clustering toolkit by G. Karypis, UMN
What is QTP ► QTP stands QuickTest Professional ► It is an automated testing tool provided by HP/Mercury Interactive ► QTP integrates with other Mercury.
Pascal Visualization Challenge Blaž Fortuna, IJS Marko Grobelnik, IJS Steve Gunn, US.
Calculation BIM Curriculum 07. Topics  Calculation with BIM  List Types  Output.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Funded by: European Commission – 6th Framework Project Reference: IST WP 2: Learning Web-service Domain Ontologies Miha Grčar Jožef Stefan.
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
Simply Visualizing Politics: voxPolitico Adrian Besimi, Visar Shehu Contemporary Sciences and Technologies South East European University
BioKnOT Biological Knowledge through Ontology and TFIDF By: James Costello Advisor: Mehmet Dalkilic.
Defining Text Mining Preprocessing Transforming unstructured data stored in document collections into a more explicitly structured intermediate format.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
Introduction to Web Mining Spring What is data mining? Data mining is extraction of useful patterns from data sources, e.g., databases, texts, web,
Mini-Project on Web Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs.
National Center for Supercomputing Applications NCSA OPIE Presentation November 2000.
RCDL Conference, Petrozavodsk, Russia Context-Based Retrieval in Digital Libraries: Approach and Technological Framework Kurt Sandkuhl, Alexander Smirnov,
The Future of the iPlant Cyberinfrastructure: Coming Attractions.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comprehensive Comparison Study of Document Clustering.
Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.
Wei Feng , Jiawei Han, Jianyong Wang , Charu Aggarwal , Jianbin Huang
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Frequent Subgraph Discovery Michihiro Kuramochi and George Karypis ICDM 2001.
IAT Text ______________________________________________________________________________________ SCHOOL OF INTERACTIVE ARTS + TECHNOLOGY [SIAT]
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Mining massive document collections by the WEBSOM method Presenter : Yu-hui Huang Authors :Krista Lagus,
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
D3 Practicum CS 4460 – Information Visualization Megha Sandesh Prepared under advisement by Dr.Fames Foley.
NETWORK VISUALIZATION ABHISHEK KUMAR (2011CS50272)
Implement Viewing Transactions in Real Time James Payne Managing Director for New Media / Advancement July 27, 2015.
Your university or experiment logo here Performance Monitoring Gidon Moont e-Science, HEP, Imperial College London Talk to JRA1.
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
A New OLAP Aggregation Based on the AHC Technique DOLAP 2004 R. Ben Messaoud, O. Boussaid, S. Rabaséda Laboratoire ERIC – Université de Lyon 2 5, avenue.
DATA MINING LECTURE 8 Sequence Segmentation Dimensionality Reduction.
MMM2005The Chinese University of Hong Kong MMM2005 The Chinese University of Hong Kong 1 Video Summarization Using Mutual Reinforcement Principle and Shot.
Title EECS Capstone Project Presentation Authors.
Lecture 11 Introduction to R and Accessing USGS Data from Web Services Jeffery S. Horsburgh Hydroinformatics Fall 2013 This work was funded by National.
1 Dongheng Sun 04/26/2011 Learning with Matrix Factorizations By Nathan Srebro.
Information Retrieval in Practice
<Student’s name>
Proposal for Term Project
Leveraging BI in SharePoint with PowerPivot and Power View
Introduction to Advance Web Technologies
Network Visualization
Parallelism in High-Performance Computing Applications
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
Presentation transcript:

Using to Save Lives Or, Using Digg to find interesting events. Presented by: Luis Zaman, Amir Khakpour, and John Felix

Outline Introduction Motivation and Approach Preprocessing Clustering Document Clustering Results Visualization Tools Demo!

Explanation  Digg is a social web-media discovery tool based on user submitted content.  1 or 2 submissions a minute  Half-life of “interest” is about a day  Digg aggregates “interesting” content.  But how do we find interesting Events and know their Themes?

Motivation  Collaborative nature of Social Media can scour the WWW very thoroughly.  But, this generates A LOT of data (you’ll see).  It would be cool to find emergencies, or critical situations based on this collaborative media.  Apple seems like a pretty good starting point.

Approach Get digg time series data for 3 months Cluster digg stories Visualize the time series. Show hot “topics” for a clicked point in the graph

Preprocessing  Digg API  REST API   XML response   Limitations  100 results per request  1 Hour of time series data  Can’t go fast, or else.

Preprocessing  Time Series  Each digg is the event (only 100 at a time)  Rows  Each story’s digg count  Columns  Every hour (2,207 of them from August 08 – November 08)  Clustering  Rows  Each story that was digged at any point in the time series  Columns  The words in the title and description of this story

Preprocessing - Challenges  SLOW  Really Dirty Data  Different Formats of Data  REALLY SLOW

Introduction to Document Clustering  Challenges of clustering of text documents unlike structured data are:  Volume  Dimensionality  Sparsity  Complex semantics  In information retrieval and text mining, text data is represented in a common representation model, e.g. Vector Space Model (VSM)  Huge sparse matrix, we just store non-zero values Text Text documents are converted to A m,n where for m documents and total number of n words (or phrases), each element x i,j represents the frequency of the j th term in the i th document.

Clustering  Dataset  Number of stories (m) :  Total number of unique words (n):  Nonzero values: ( %)  Clustering using Cluto Software  Using Kmeans, bisecting Kmeans  Calculating Centroids and SSE  A C++ program is run on “black”

Document Clustering by Optimizing Criterion Functions  According to Zhao et.al, to have a good clustering for documents we can use some Criterion Function and use optimization to find clusters:  Internal Criterion Functions (I)  Maximizing the internal similarity function:  External Criterion Functions (E)  Minimizing the external similarity function:  Hybrid Criterion Functions (H)  Maximizing

Experiments  SSE for I (K-Means vs Bisecting K-Means)

Visualization  What we used  jQuery  Database query library for javascript  PHP/MySQL  Scripting language and database backend  Google Visualization API  Time Series Graph  Zoomable  Timepedia Chronoscope  Clickable

Conclusions  Success?  Of course we think so  Future Work  Save lives?  Better clustering  Cleaner data  More data  Make it scalable, and dynamic  On-line and on the fly?