Yi-Chia Wang LTI 2nd year Master student

Slides:



Advertisements
Similar presentations
Generation of Multimedia TV News Contents for WWW Hsin Chia Fu, Yeong Yuh Xu, and Cheng Lung Tseng Department of computer science, National Chiao-Tung.
Advertisements

How to Write a Review Article
Learning to Suggest: A Machine Learning Framework for Ranking Query Suggestions Date: 2013/02/18 Author: Umut Ozertem, Olivier Chapelle, Pinar Donmez,
Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology,
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
The influence of search engines on preferential attachment Dan Li CS3150 Spring 2006.
Systems Engineering and Engineering Management The Chinese University of Hong Kong Parameter Free Bursty Events Detection in Text Streams Gabriel Pui Cheong.
Search Engines and Information Retrieval
Introduction to Automatic Classification Shih-Wen (George) Ke 7 th Dec 2005.
Blogosphere  What is blogosphere?  Why do we need to study Blog-space or Blogosphere?
LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002.
On Burstiness-Aware Search for Document Sequences Theodoros Lappas Benjamin Arai Manolis Platakis Dimitrios Kotsakos Dimitrios Gunopulos SIGKDD 2009.
Scalable Text Mining with Sparse Generative Models
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
Attention and Event Detection Identifying, attributing and describing spatial bursts Early online identification of attention items in social media Louis.
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
Masquerade Detection Mark Stamp 1Masquerade Detection.
Introduction The large amount of traffic nowadays in Internet comes from social video streams. Internet Service Providers can significantly enhance local.
Golder and Huberman, 2006 Journal of Information Science Usage Patterns of Collaborative Tagging System.
Efficient Scheduling of Heterogeneous Continuous Queries Mohamed A. Sharaf Panos K. Chrysanthis Alexandros Labrinidis Kirk Pruhs Advanced Data Management.
Search Engines and Information Retrieval Chapter 1.
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Prepare Yourself for IR Research ChengXiang Zhai Department of Computer.
Understanding and Predicting Graded Search Satisfaction Tang Yuk Yu 1.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007 Date: 5 June 2008 Speaker: Li, Huei-Jyun Advisor: Dr. Koh, Jia-Ling.
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
Temporal Analysis using Sci2 Ted Polley and Dr. Katy Börner Cyberinfrastructure for Network Science Center Information Visualization Laboratory School.
B. Prabhakaran1 Multimedia Systems Textbook Any/Most Multimedia Related Books Reference Papers: Appropriate reference papers discussed in class from time.
Giorgos Giannopoulos (IMIS/”Athena” R.C and NTU Athens, Greece) Theodore Dalamagas (IMIS/”Athena” R.C., Greece) Timos Sellis (IMIS/”Athena” R.C and NTU.
Feedback Effects between Similarity and Social Influence in Online Communities David Crandall, Dan Cosley, Daniel Huttenlocher, Jon Kleinberg, Siddharth.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Query Suggestion Naama Kraus Slides are based on the papers: Baeza-Yates, Hurtado, Mendoza, Improving search engines by query clustering Boldi, Bonchi,
Analysis of Topic Dynamics in Web Search Xuehua Shen (University of Illinois) Susan Dumais (Microsoft Research) Eric Horvitz (Microsoft Research) WWW 2005.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Query Based Event Extraction along a Timeline H.L. Chieu and Y.K. Lee DSO National Laboratories, Singapore (SIGIR 2004)
Scalable and Near Real-Time Burst Detection from eCommerce Queries Nish Parikh, Neel Sundaresan ACM SIGKDD ’08 Presenter: Luo Yiming.
Web-Mining …searching for the knowledge on the Internet… Marko Grobelnik Institut Jožef Stefan.
1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.
Post-Ranking query suggestion by diversifying search Chao Wang.
Discovering Evolutionary Theme Patterns from Text - An Exploration of Temporal Text Mining Qiaozhu Mei and ChengXiang Zhai Department of Computer Science.
1 CS 430: Information Discovery Lecture 5 Ranking.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.
B. Prabhakaran1 Multimedia Systems Reference Text “Multimedia Database Management Systems” by B. Prabhakaran, Kluwer Academic Publishers. – Kluwer bought.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.
Extrapolation to Speed-up Query- dependent Link Analysis Ranking Algorithms Muhammad Ali Norozi Department of Computer Science Norwegian University of.
Mining High-Speed Data Streams Presented by: William Kniffin Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Conference
 DM-Group Meeting Liangzhe Chen, Oct Papers to be present  RSC: Mining and Modeling Temporal Activity in Social Media  KDD’15  A. F. Costa,
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
Information Retrieval in Practice
Welcome to ….. File Organization.
How to Write a Review Article
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Online Frequent Episode Mining
How to Crawl the Web Peking University 12/24/2003 Junghoo “John” Cho
RE-Tree: An Efficient Index Structure for Regular Expressions
CS 430: Information Discovery
Lin Lu, Margaret Dunham, and Yu Meng
An Efficient method to recommend research papers and highly influential authors. VIRAJITHA KARNATAPU.
#VisualHashtags Visual Summarization of Social Media Events using Mid-Level Visual Elements Sonal Goel (IIIT-Delhi), Sarthak Ahuja (IBM Research, India),
Bursty and Hierarchical Structure in Streams
Struggling and Success in Web Search
Multimedia Systems Reference Text
Example: Academic Search
DATABASE HISTOGRAMS E0 261 Jayant Haritsa
Analyzing Two Participation Strategies in an Undergraduate Course Community Francisco Gutierrez Gustavo Zurita
Graph and Link Mining.
Building Topic/Trend Detection System based on Slow Intelligence
Citation databases and social networks for researchers: measuring research impact and disseminating results - exercise Elisavet Koutzamani
Presentation transcript:

Yi-Chia Wang LTI 2nd year Master student Analysis of Social Media Trend Analysis Yi-Chia Wang LTI 2nd year Master student

Analysis of Social Media 2007 Introduction Document streams Arrive continuously over time E-mail, news articles, search engine query logs, … Identify topics in document streams Topic detection and tracking Text mining Visualization … Is there a better organizing principle for the enormous archives of document streams? Temporal information in document streams Trausan-Matu et al., 2007 Oct-30 Analysis of Social Media 2007

Analysis of Social Media 2007 “Burst of activity” Topics appear, grow in intensity for a period of time, and then fade away. Bursts correspond to points at which the intensity of message arrivals increases sharply Problems with naive identification of bursts Easily identifying large numbers of short bursts Fragmenting long burst into many smaller ones Goal: identifying bursts only when they have sufficient intensity Oct-30 Analysis of Social Media 2007

Bursty and Hierarchical Structure in Streams Jon Kleinberg Department of Computer Science Cornell University SIGKDD ‘02 My advisor is Carolyn Rose This work is funded by PSLC The project name is TagHelper

Two-state Automaton (A) Model Idea: periods of lower message intensity interleave with periods of higher message intensity A begins in state q0 A changes state with probability p When in state q0, messages are emitted at a slow rate; when in state q1, messages are emitted at a faster rate time intensity q0 q1 q0 q1 p 1-p States correspond to higher and higher message intensities State transitions signal bursts Emissions decide arrival times of next messages Oct-30 Analysis of Social Media 2007

Exponential Distribution Modeling the message emission rate Modeling the time gap between messages and Modeling by exponential distribution with parameter being the rate of message arrivals Wikipedia Oct-30 Analysis of Social Media 2007

Two-state Automaton (A) Model Formally, given: messages with specified arrival times : inter-arrival gaps We want to determine the conditional probability of a state sequence Given a set of messages, one can find a likely state sequence based on the model Oct-30 Analysis of Social Media 2007

Two-state Automaton (A) Model Finding a state sequence q maximizing the probability Equivalently, minimizing the following cost function: Favoring state sequences that conform well to the sequence x of gap values Favoring sequences with a small number of state transitions Oct-30 Analysis of Social Media 2007

Infinite-state Automata Model Cost Function Oct-30 Analysis of Social Media 2007

Computing a minimum-cost state sequence THEOREM: If q* is an optimal state sequence in , then it is also an optimal state sequence in Dynamic programming is used for searching an optimal state sequence Oct-30 Analysis of Social Media 2007

Bursts exhibit a natural nested structure A burst of intensity j is a maximal interval over which a part of state sequence is in a state of index j or higher Bursts can also be represented as a tree. Each burst is a node in the tree Oct-30 Analysis of Social Media 2007

Analysis of Social Media 2007 Experiments The model makes sense for many datasets (of an analogous flavor) Email Titles of conference papers U.S. Presidential State of the Union Addresses Web clickstreams Oct-30 Analysis of Social Media 2007

Analysis of Social Media 2007 Email Dataset Is the appearance of messages containing particular words exhibits a burst in the vicinity of significant times such as deadlines? Author’s own collection of email June 9, 1997 – August 23, 2001 34344 messages (41.7 MB) Focusing on the response set Oct-30 Analysis of Social Media 2007

Results for the Word - ITR ITR is the name of a large NSF program The author wrote 2 proposals for it in 1999-2000; one is a small proposal while another is a large one The intervals are annotated with the first and last dates of the messages The first subtree splits further into 2 subtrees For the 2nd subtree, there is no burst since the author did not continue the submission The figure shows the resulting bursts for the optimal state sequence The intervals are annotated with the first and last dates of the messages Oct-30 Analysis of Social Media 2007

Results for the Word - prelim Prelim is the term used at Cornell for non-final exams The author taught courses in 4 of the 8 semesters covered by the collection of email, and each of these courses had 2 prelims For the first of these courses, there was a special course email account For remaining 3 courses, each corresponds to a long burst and 2 shorter, more intense bursts for the particular prelims The 2 structures suggest how a large folder of email might naturally be divided into a hierarchical set of sub-folders around certain key events, based only on the rate of message arrivals Oct-30 Analysis of Social Media 2007

Titles of Conference Papers Goal: extracting bursts in term usage from the titles of conference papers over the past several decades Problem: conference papers arrive in discrete batches every half or one year  no message inter-arrivals gaps Modified automaton model: Generating batched arrivals For each state, there is an expected fraction of relevant documents Bursty is identified if the fraction of relevant documents increases Oct-30 Analysis of Social Media 2007

Titles of Conference Papers Cost function for each arrival batch: The weight of the burst : the improvement in cost by using state q1 rather than state q0 Oct-30 Analysis of Social Media 2007

Analysis of Social Media 2007 SIGMOD & VLDB, 1975-2001 Considering each word in paper titles The 30 bursts of highest weight The bursts with no ending date  the interval extends to the most recent conference These bursty words are different from a list of common words The bursts are picking up trend in language use Oct-30 Analysis of Social Media 2007

Analysis of Social Media 2007 STOC & FOCS, 1969-2001 The 30 bursts of highest weight Particular titling conventions that were in fashion for certain periods “How to construct random functions” … Oct-30 Analysis of Social Media 2007

U.S. Presidential State of the Union Addresses Kleinbergh, SIGKDD ‘02 Oct-30 Analysis of Social Media 2007

Web usage data – clickstreams Settings: 80 undergraduate students Two and a half months in Spring 2000 For every URL w, all bursts in the stream of visits to w are determined Focusing on high-weighted bursts as well as those that involve at least 10 distinct users Results: High-ranked bursts involve the URLs of the online class reading assignments, centered on intervals shortly before and during the weekly sessions at which they were discussed Oct-30 Analysis of Social Media 2007

Analysis of Social Media 2007 Conclusions Modeling streams using an infinite-state automaton State transitions lead to bursts First story detection: a single message on which the associated state transition occurred The model offers a means of structuring the information from our patterns of interacting and communicating Document streams have a strong temporal character In many domains, we are accumulating detailed records of our own communication and behavior Oct-30 Analysis of Social Media 2007