Emerging Topic Detection on Twitter (Cataldi et al., MDMKDD 2010) Padmini Srinivasan Computer Science Department Department of Management Sciences

Slides:



Advertisements
Similar presentations
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Advertisements

WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
Towards Twitter Context Summarization with User Influence Models Yi Chang et al. WSDM 2013 Hyewon Lim 21 June 2013.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Hashtags as Milestones in Time Identifying the hashtags for meaningful events using Twitter search logs and Wikipedia data Stewart Whiting University of.
@ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University.
Kira Radinsky, Sagie Davidovich, Shaul Markovitch Computer Science Department Technion – Israel Institute of technology.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:
A Method for Focused Crawling Using Combination of Link Structure and Content Similarity SeyedMohsen (Mohsen) Jamali
Problem Addressed Attempts to prove that Web Crawl is random & biased image of Web Graph and does not assert properties of Web Graph Understanding the.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Learning Bit by Bit Search. Information Retrieval Census Memex Sea of Documents Find those related to “new media” Brute force.
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
Overview of Web Data Mining and Applications Part I
What’s The Difference??  Subject Directory  Search Engine  Deep Web Search.
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Personalization in Local Search Personalization of Content Ranking in the Context of Local Search Philip O’Brien, Xiao Luo, Tony Abou-Assaleh, Weizheng.
On Sparsity and Drift for Effective Real- time Filtering in Microblogs Date : 2014/05/13 Source : CIKM’13 Advisor : Prof. Jia-Ling, Koh Speaker : Yi-Hsuan.
Citation Recommendation 1 Web Technology Laboratory Ferdowsi University of Mashhad.
Search Engine optimization.  Search engine optimization (SEO) is the process of affecting the visibility of a website or a web page in a search engine's.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.
Selecting a Topic and Purpose
Crawlers Padmini Srinivasan Computer Science Department Department of Management Sciences
Searching the Web Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial 3.0Attribution-NonCommercial.
Redeeming Relevance for Subject Search in Citation Indexes Shannon Bradshaw The University of Iowa
Using Hyperlink structure information for web search.
The Search Engine Landscape: 2010 How Users Interact with Engines & How the Search Engines Crawl, Index & Rank Pages Rand Fishkin CEO & Co-Founder: SEOmoz.
Homework 4 Final homework Deadline: Sunday April 20, PM In this homework you have to write a short essay on how Google can handle new types of data.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
SIRS Issues Researcher Insight into today’s Leading Issues sks.sirs.com | proquestk12.com.
Fall 2006 Davison/LinCSE 197/BIS 197: Search Engine Strategies 2-1 How Search Engines Work Today we show how a search engine works  What happens when.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
SEO  What is it?  Seo is a collection of techniques targeted towards increasing the presence of a website on a search engine.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Social Networking Algorithms related sections to read in Networked Life: 2.1,
Glasgow 02/02/04 NN k networks for content-based image retrieval Daniel Heesch.
MULTIMEDIA DEFINITION OF MULTIMEDIA
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Gregor Gisler-Merz How to hit in google The anatomy of a modern web search engine.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Social Media 101 An Overview of Social Media Basics.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Publication Spider Wang Xuan 07/14/2006. What is publication spider Gathering publication pages Using focused crawling With the help of Search Engine.
Ch 14. Link Analysis Padmini Srinivasan Computer Science Department
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
SEO Who knew 3 letters could mean so much?. What is SEO? Search Engine Optimization (SEO) is the practice of improving and promoting a web site in order.
Algorithmic Detection of Semantic Similarity WWW 2005.
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
Augmenting Focused Crawling using Search Engine Queries Wang Xuan 10th Nov 2006.
Search Engine-Crawler Symbiosis: Adapting to Community Interests
“In the beginning -- before Google -- a darkness was upon the land.” Joel Achenbach Washington Post.
1 CS 430: Information Discovery Lecture 18 Web Search Engines: Google.
Digital Images / Write Copy CUFIMA01A Produce And Manipulate Digital Images CUFWRT05A Write Content And/Or Copy Week 4.
Evaluating Event Credibility on Twitter Presented by Yanan Xie College of Computer Science, Zhejiang University 2012.
Applying Link-based Classification to Label Blogs Smriti Bhagat, Irina Rozenbaum Graham Cormode.
Chapter 8: Web Analytics, Web Mining, and Social Analytics
General Architecture of Retrieval Systems 1Adrienn Skrop.
CS 440 Database Management Systems Web Data Management 1.
Crawling When the Google visit your website for the purpose of tracking, Google does this with help of machine, known as web crawler, spider, Google bot,
DATA MINING Introductory and Advanced Topics Part III – Web Mining
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Multimedia Information Retrieval
Presentation transcript:

Emerging Topic Detection on Twitter (Cataldi et al., MDMKDD 2010) Padmini Srinivasan Computer Science Department Department of Management Sciences

Twitter 2 to 3 new users/second!! 175 million users as of today 95 million tweets written each day 300 employees and hiring! “what are you doing” “what’s happening” Daily chatter, URL sharing, news – Distributed/local reporting; no filters; no editing – Fastest, lowest level information service Not a social network but an information network

Aggregators & others Plenty: – Tweetmeme Separates the kinds of media linked to (video, image..) – Twitter search – Twistori Aggregates emotions – Tweetsentiments.com – Retail Twitter Aggregation – TweetTabs – Where, what, when

Finding ‘Emergent’ Topics What is it? – A topic popular in current time and not in the past 5 steps: – Represent tweets as vectors of terms (language independent) – Graph active authors’ social relationships (PageRank) – Model each term’s life cycle: “novel aging theory” – Rank terms based on their “energy”, select top few – Create a navigable topic graph linking emerging terms with co-occurring ones: Emerging Topics.

Cool points Nice overview with enough details about text representation, processing etc. Hypothesis: flow of ideas from geographical origin of event to outside – So find the starting tweets and you find the locale – Always so? Global event? Disasters? See PageRank in action – Author network in particular SCC – Term networks Biological metaphor – Lots of different terms: nutrition, calories, energy… Reasonable case study Language – humour – heartquake – Not dumping factor

Some trends

Last Class Continuation Crawler Evaluation What are good pages? Web scale is daunting User based crawls are short, but web agents? Page importance assessed – Presence of query keywords – Similarity of page to query/description – Similarity to seed pages (held out sample) – Use a classifier – not the same as used in crawler – Link-based popularity (but within topic?)

Summarizing Performance Precision – Relevance is Boolean: yes/no Harvest rate: # of good pages/total # pages – Relevance is continuous Average relevance over crawled set – Recall Target recall: held out seed pages (H) – |H ∧ pages crawled|/|pages crawled| Robustness – Start same crawler on disjoint seed sets. Examine overlap of fetched pages

Sample Performance Graph

Summary Crawler architecture Crawler algorithms Crawler evaluation Assignment 1 – Run two crawlers for 5000 pages. – Start with the same set of seed pages for a topic. – Look at overlap and report this over time (robustness)