April 2014 SEWM Event Detection from Social Media: User-centric Parallel Split-n-merge and Composite Kernel Truc-Vien T. Nguyen, Lugano University, Swiss Minh-Son Dao, University of Information Technology, Vietnam Riccardo Mattivi, Trento University, Italy Francesco G.B. De Natale, Trento University, Italy SEWM – ICMR – 2014 Glasgow, UK
Outline Social Event and Web-media User-centric Parallel Split-n-merge for Events Clustering Composite Kernel for Event Classification Ongoing work Conclusion April 2014 SEWM 20142
April 2014 SEWM Tsunami -Miyagi, Japan -Mar 11, Tsunami -Miyagi, Japan -Mar 11, 2011
Observations Time-Location: Users cannot attend two events at the same time at different places whose locations are far away each other Theme: Users in the same community tend to TAG the same event with similar words Users tend to take series of images in a short interval time for what they pay attention Images related to an event of a given type share some common visual features that are characteristic for that event type Spatio-Temporal-Theme April 2014 SEWM 20144
User-centric Parallel Split-n-merge April 2014 SEWM Web media collection A crawled from Social Networks Convert A to UT-image Split each row of UT-image into clusters {b i } Merge {b i } using {location, time, theme} Merge {b i } using {location, time, theme} and Common-sense Merge {b i } using visual information
UT-Image April 2014 SEWM photo_url username dateTaken title description tags locations users time Sort by time for each row. Those pixels (in the same row) do not have time will be grouped and put at the beginning of the row
Split by TIME April 2014 Truc-Vien T. Nguyen7 If no time information, each pixel is treated as one cluster If there is time information
Merge by spatio-time-theme April 2014 Truc-Vien T. Nguyen8 for selected cluster b k, create -time-taken-boundary T k -Location-union L k -Document (tag, title, description) D k for any pair of clusters (b k, b l ), merge if 2/3 following conditions are hold -Tdistance(T k, T l ) ≤α -Ldistance(L k,L l ) ≤ β -JaccardIndex(D k, D l ) ≥ γ
Merge by common-sense April 2014 Truc-Vien T. Nguyen9 Process tf-idf on D k and select the most COMMON key-words to create ND k With any pair of cluster (b k,b l ), merge if JaccardIndex(ND k, ND l ) ≥ γ
Merge with Visual features April 2014 Truc-Vien T. Nguyen10 with any pair of cluster (b k, b l ), merge if JaccardIndex(BoW k, BoW l ) ≥ θ
Results – Events clustering April 2014 SEWM MediaEval 2013 dataset and participants
Result - Events Clustering April 2014 SEWM The first run (Split, Merge by spatio-location-them) α=24 hours, β=5km, γ=0.2 -The second run (as the first) α=8 hours, β=2km, γ=0.2 -The third run (as the first plus common-sense merging) -The last run, as the third plus visual feature θ= 0.3
April 2014 SEWM Classification Problems Supervised Learning: learn a function : → from examples Binary Classification: = {-1, +1} Multi-class Classification: = {1,2,…,k} Event Classification: Each member of has a set of features
April 2014 SEWM SVM- Multiclass Classification Support Vector Machines (SVMs) Binary classification Computing a function (Kernel) between each pair of samples One Vs. Rest Multi-class Classification
April 2014 SEWM Event Categories ClassEvent Type 0Conference 1Fashion 2Concert 3Non_event 4Sports 5Protest 6Other 7Exhibition 8Theater_dance
April 2014 SEWM Composite Kernel text features Coefficient visual features ,1),(),(EEKEEKEECK VT
April 2014 SEWM Text Features NLP basic features: the word, its lower-case, four prefixes, four suffixes, orthographic feature, word form feature. Ontological features: obtained by matching w i with a knowledge base, for ex. “Washington”->City Encyclopedic features: obtained by associating w i with Wikipedia, for ex. “Washington”->
An excerpt from the ontology April 2014 SEWM
Visual Features April 2014 SEWM Dense RGB-SIFT - SVM with histogram intersection kernel - the SVMs have been trained with the images given in the SED training set - codebook for the bag of words with 4096 visual words
Results – Events Classification April 2014 SEWM Run with test-set cross-validation on the training set
Ongoing work April 2014 SEWM Events clustering Web media Events classification Training data -Set of instances of events -Have ability of automatically annotating events -Extend to “automatically annotation images” Topic modeling (apply on set of document D k ) name clusters classifiers events Improve events clustering qualification
Conclusion April 2014 SEWM Event clustering -Simple and easy to develop -Can develop to run on parallel mode -Need to find the way to automatically adjust parameters 2.Event classification -Composite kernel combined both text and visual features -The combination has proved its robustness with a significant improvement in performance (from 45.83% to 53.58% with basic features, and from 47.61% to 54.86% with our new features) -Encyclopedic knowledge such as Wikipedia, could provide a great additional resource
Thanks for your attention April 2014 SEWM Q & A
April 2014 Truc-Vien T. Nguyen24 Features w i is text of the title, description, or the tag in each event l i is the word w i in lower-case p1 i, p2 i, p3 i, p4 i are the four prefixes of w i s1 i, s2 i, s3 i, s4 i are the four suffixes of w i f i is the part-of-speech of w i g i is the orthographic feature that test whether a word contains all upper-cased, initial letter upper-cased, all lower-cased. k i is the word form feature that test whether a token is a word, a number, a symbol, a punctuation mark. o i is the ontological features. We used an ontology and knowledge base that contains 355 classes, 99 properties, and more than 100,000 entities. Given a full ontology, w i is be matched to the deepest subsumed child class.