Geotagging Social Media Content with a Refined Language Modelling Approach Georgios Kordopatis-Zilos, Symeon Papadopoulos, and Yiannis Kompatsiaris Centre.

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

Sheldon Brown, UCSD, Site Director Milton Halem, UMBC Director Yelena Yesha, UMBC Site Director Tom Conte, Georgia Tech Site Director Fundamental Research.

WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.

Geo location techniques Group 2 Mukund Malladi Rajasekhar Ganduri Siddhartha Katragadda.

VisualRank: Applying PageRank to Large-Scale Image Search Yushi Jing, Member, IEEE, and Shumeet Baluja, Member, IEEE.

Information Extraction from Multimedia Content on the Social Web Stefan Siersdorfer L3S Research Centre, Hannover, Germany.

Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.

Company confidential Prepared by HERE Transit Sr. Product Manager, HERE Transit Product Overview David Volpe.

Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.

Landmark Classification in Large- scale Image Collections Yunpeng Li David J. Crandall Daniel P. Huttenlocher ICCV 2009.

Tour the World: building a web-scale landmark recognition engine ICCV 2009 Yan-Tao Zheng1, Ming Zhao2, Yang Song2, Hartwig Adam2 Ulrich Buddemeier2, Alessandro.

Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.

Introduction to Automatic Classification Shih-Wen (George) Ke 7 th Dec 2005.

Report on Intrusion Detection and Data Fusion By Ganesh Godavari.

Chorus cluster meeting, Vilamoura April SAPIR Search in Audio-visual content using P2p IR Yosi Mass, Raul Santos.

LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002.

Commentary-based Video Categorization and Concept Discovery By Janice Leung.

McGraw-Hill/Irwin Copyright © 2008, The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin Copyright © 2008 The McGraw-Hill Companies, Inc.

Presented by Zeehasham Rasheed

Important Task in Patents Retrieval Recall is an Important Factor Given Query Patent -> the Task is to Search all Related Patents Patents have Complex.

The Social Web: A laboratory for studying s ocial networks, tagging and beyond Kristina Lerman USC Information Sciences Institute.

EVENT IDENTIFICATION IN SOCIAL MEDIA Hila Becker, Luis Gravano Mor Naaman Columbia University Rutgers University.

Usability Evaluation of Digital Libraries Stacey Greenaway Submitted to University of Wolverhampton module Dec 15 th 2006.

Large-Scale Cost-sensitive Online Social Network Profile Linkage.

Information Retrieval in Practice

Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 김지연.

Wang, Z., et al. Presented by: Kayla Henneman October 27, 2014 WHO IS HERE: LOCATION AWARE FACE RECOGNITION.

Finding Wormholes with Flickr Geotags Maarten Clements Marcel Reinders Arjen de Vries Pavel Serdyukov December 3 rd, 2009 GIS.

Attention and Event Detection Identifying, attributing and describing spatial bursts Early online identification of attention items in social media Louis.

Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.

MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.

Citation Recommendation 1 Web Technology Laboratory Ferdowsi University of Mashhad.

1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.

COMPUTER-ASSISTED PLAGIARISM DETECTION PRESENTER: CSCI 6530 STUDENT.

Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.

material assembled from the web pages at

Beyond Co-occurrence: Discovering and Visualizing Tag Relationships from Geo-spatial and Temporal Similarities Date : 2012/8/6 Resource : WSDM’12 Advisor.

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.

Event Detection using Customer Care Calls 04/17/2013 IEEE INFOCOM 2013 Yi-Chao Chen 1, Gene Moo Lee 1, Nick Duffield 2, Lili Qiu 1, Jia Wang 2 The University.

Workshop on Social Events in Web Multimedia, ICMR 2014 Social Event Detection at MediaEval: a three-year retrospect of tasks and results Georgios Petkos,

Report on Intrusion Detection and Data Fusion By Ganesh Godavari.

Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,

Intent Subtopic Mining for Web Search Diversification Aymeric Damien, Min Zhang, Yiqun Liu, Shaoping Ma State Key Laboratory of Intelligent Technology.

A Language Independent Method for Question Classification COLING 2004.

Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.

April 2014 SEWM Event Detection from Social Media: User-centric Parallel Split-n-merge and Composite Kernel  Truc-Vien T. Nguyen, Lugano University,

Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.

Data and Applications Security Developments and Directions Dr. Bhavani Thuraisingham The University of Texas at Dallas Lecture #15 Secure Multimedia Data.

Named Entity Disambiguation on an Ontology Enriched by Wikipedia Hien Thanh Nguyen 1, Tru Hoang Cao 2 1 Ton Duc Thang University, Vietnam 2 Ho Chi Minh.

UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.

Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq

Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -

Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.

SAPIR Search in Audio-Visual Content using P2P Information Retrival For more information visit: Support.

Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.

Duc-Tien Dang-Nguyen, Giulia Boato, Alessandro Moschitti, Francesco G.B. De Natale Department to Information and Computer Science –University of Trento.

A Supervised Machine Learning Algorithm for Research Articles Leonidas Akritidis, Panayiotis Bozanis Dept. of Computer & Communication Engineering, University.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

A DΙgital Library Infrastructure on Grid EΝabled Technology SAPIR – Search in Audio Visual Content.

ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.

Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.

Using Social Media to Enhance Emergency Situation Awareness

An Empirical Study of Learning to Rank for Entity Search

Summary Presented by : Aishwarya Deep Shukla

Extracting Semantic Concept Relations

Lab 2: Information Retrieval

Privacy-Aware Tag Recommendation for Image Sharing

Presentation transcript:

Geotagging Social Media Content with a Refined Language Modelling Approach Georgios Kordopatis-Zilos, Symeon Papadopoulos, and Yiannis Kompatsiaris Centre for Research and Technology Hellas (CERTH) – Information Technologies Institute (ITI) PAISI 2015, May 19, 2015, Ho Chi Minh City, Vietnam

Where is it? #2 Depicted landmark Eiffel Tower Location Paris, Tennessee Keyword “Tennesee” is very important to correctly place the photo. Source (Wikipedia): er_(Paris,_Tennessee)

The Problem A lot of multimedia content is associated with geographic information Being able to collect and analyze large amounts of geotagged content could be very useful for several applications, e.g., situational awareness in incidents such as natural disasters, verification, geographic trends, etc. Yet, only a very small percentage of Web media content carries explicit information (i.e. GPS coordinates), for instance ~1% of tweets are geotagged To this end, methods that can infer the geographic location of Web multimedia content are of interest. #3

A Refined Language Model for Geotagging Extend and improve the widely used Language Model for the problem of location estimation from text metadata The proposed improvements include: –Feature selection based on a cross-validation approach –Feature weighting based on spatial entropy –Multiple resolution grids Extensive evaluation on a public benchmark shows highly competitive performance and reveals new insights and challenges #4

Related Work: Gazetteer-based Methods that use large dictionaries and Volunteered Geographic Information (VGI), e.g., Geonames, Yahoo! GeoPlanet, OpenStreetMap, etc. Semantics-based IR approach for integrating gazetteers and VGI (Kessler et al., 2009) Similarity matching mediating multiple gazetteers in a meta-gazetteer service (Smart et al., 2010) Comma groups extracted with heuristic methods from lists of toponyms (Lieberman et al., 2010) #5

Related Work: Language Models Language Models: large corpora of geotagged text to create location-specific language model, i.e. what are the most frequent keywords for a given location Base approach (Serdyukov et al., 2009) Disjoint dynamically sized cells (Hauff et al., 2012) User frequency instead of term frequency (O’Hare & Murdock, 2012) Clustering, use of χ 2 for feature selection and similarity search (Van Laere et al., 2011) #6

Related Work: Multimodal Approaches Multimodal approaches do not only use text, but also leverage the visual content and other social metadata of geotagged multimedia. Combination of text metadata and visual content at two levels of granularity, city- (100km) and landmark-level (100m) (Crandall et al., 2009) Build user models leveraging user’s upload history, SN data and hometown (Trevisiol et al., 2013) Hierarchical approach using both text-based and visual similarity (Kelm et al., 2011) #7

Related Work: MediaEval Placing Task Yearly benchmarking task where different approaches compete –Each participant can submit up to 5 runs with different instances/configurations of their method Dataset for Placing Task 2014 –Flickr CC-licensed images & videos, subset of YFCC 100M –Training: 5M, Testing: 510K (multiple subsets of increasing size are used for reporting) Evaluation –Estimated location of test image/video is compared against the known one, and it is checked whether it belongs to a circle of radius of 10m, 100m, 1km, 10km, 100km and 1000km –Then, the percentage of images/videos that were correctly placed within each radius are reported, e.g., Competing approaches: both gazetteer- and LM-based #8

Overview of Approach #9

Geographic Language Model (1/2) Training data: Corpus D tr of images and videos Test data: Corpus D ts For each item (either in training or test data), we have: user id, title, tags, description Title and tags of training images used for building the model. For testing, description is used only if the item has neither title nor tags associated with it. Pre-processing: punctuation and symbol removal, lowercasing, numeric tags removed, composite phrases (e.g. “new+york”  “new”, “york”) are split into their components #10

Geographic Language Model (2/2) #11

Geographic Language Model: Example #12 new: 0.15 york: 0.27 manhattan: 0.45 liberty: 0.33 … nyc: 0.52

Feature Selection #13

Feature Weighting using Spatial Entropy #14

Entropy Histogram & Gaussian Weighting #

Similarity Search #16

Multiple Resolution Grids To increase the granularity of the prediction and at the same time its reliability, we devised the following dual grid scheme: –we build two LMs: one of size 0.01 ° x 0.01 ° (coarse granularity) and one of size ° x ° (fine granularity) –conduct location estimations based on both –if the fine granularity estimations falls within the cell of the estimation based on the coarse granularity, then we select the fine granularity –otherwise, we select the coarse (since we consider it by default more reliable) #17

Evaluation Benchmark dataset: MediaEval 2014 Training set: 5M, Test set: 510K All experiments conducted on the full test set (510K) Two stages of evaluation: –participation in contest (with a limited version of the proposed approach) –post-contest performance exploration #18

Evaluation: MediaEval 2014 Contest (1/4) Out of the five runs, three were based on variations of the presented approach –run1: LM + feature weighting with spatial entropy + similarity search + multiple resolution grid –run4: LM only –run5: LM + similarity search (similarity search parameters: α=1, k=4) Performance was measured with where r = 10m, 100m, 1km, 10km, 100km and 1000km #19

Evaluation: MediaEval 2014 Contest (2/4) #20 Proposed improvements (run1) outperform base approach (run4) and base approach + similarity search (run5) The improvement is more pronounced in the small ranges (10m, 100m, 1km)

Evaluation: MediaEval 2014 Contest (3/4) #21 Proposed

Evaluation: MediaEval 2014 Contest (4/4) #22 Number of image tags

Post-Contest Evaluation Explore the role of different factors: Big training set (YFCC100M): ~48M geotagged items Feature Selection (FS) Feature Weighting with Spatial Entropy (SE) Multiple Resolution Grid (MG) Similarity Search (SS) Two settings: FAIR: All users from the training set are completely removed from the test set OVERFIT: Users are not removed from the test set even when some of their media items are included in the training set. #23

Post-Contest Evaluation #24 Clear improvement with the addition of MG and SS The proposed improvements together with the use of the bigger dataset make the approach perform better than all other methods in MediaEval 2014

Geographic Error Analysis More data leads to lower error across the globe. Several small US cities suffer from low accuracy due to having names of large European cities. #25 ALL + YFCC100M run 4

Big Data vs. Complex Algorithms #26 Using 10x more data for training led to equivalent performance with using more complex algorithm (LM + extensions) with less data!

Placeability of Media Items Sum of cell-tag probabilities is a good indicator of how confident we are in the decision of the classifier. #27

Complicated Cases #28 Statue of Liberty

Security Incident Heatmaps #29 earthquake riot

Conclusion Key contributions –Improved geotagging approach, extending the widely used language model in three ways: feature selection, weighting, multiple resolution grids –Thorough analysis of geotagging accuracy offering new insights and highlighting new challenges Future Work –Exploit visual features to improve (currently visual-only approaches perform very poorly) –Integrate gazetteer and structured location data sources (e.g. Foursquare venues, OpenStreetMap, etc.) –Evaluate in more challenging settings and datasets (e.g. Twitter, Instagram) #30

References (1/2) C. Kessler, K. Janowicz, and M. Bishr. An Agenda for the Next Generation Gazetteer: Geographic Information Contribution and Retrieval. In Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pages ACM, 2009 P. Smart, C. Jones, and F. Twaroch. Multi-source Toponym Data Integration and Mediation for a Meta-gazetteer Service. In Proceedings of the 6th international conference on Geographic information science. GIScience10. Springer-Verlag, Berlin, Heidelberg, , 2010 M.D. Lieberman, H. Samet, and J. Sankaranayananan. Geotagging: using Proximity, Sibling, and Prominence Clues to Understand Comma Groups. In Proceedings of the 6th Workshop on Geographic Information Retrieval, 2010 P. Serdyukov, V. Murdock, and R. Van Zwol. Placing Flickr Photos on a Map. In SIGIR09, pages , New York, NY, USA, ACM C. Hauff and G. Houben. Geo-location Estimation of Flickr images: Social Web based Enrichment. ECIR 2012, p Springer LNCS 7224, April N. O'Hare, and V. Murdock. Modeling Locations with Social Media. Information Retrieval, pp. 133, 2012 O. Van Laere, S. Schockaert, and B. Dhoedt. Finding Locations of Flickr Resources using Language Models and Similarity Search. ICMR 11, New York, USA, ACM #31

References (2/2) D.J. Crandall, L. Backstrom, D. Huttenlocher, and J. Kleinberg. Mapping the World's Photos. In Proceedings of the 18th international conference on World wide web, WWW 09, pages , New York, NY, USA, ACM M. Trevisiol, H. Jegou, J. Delhumeau, and S. Gravier. Retrieving Geo-Location of Videos with a Divide and Conquer Hierarchical Multimodal Approach. ICMR13, Dallas, United States, April ACM P. Kelm, S. Schmiedeke, and T. Sikora. A Hierarchical, Multi-modal Approach for Placing Videos on the Map using Millions of Flickr Photographs. In Proceedings of the 2011 ACM Workshop on Social and Behavioural Networked Media Access, SBNMA 11, pages 1520, New York, NY, USA, ACM #32

Thank you! Resources: Slides: Code: Benchmark: Get in / George Kordopatis / #33