BIBLIOGRAPHY ON EVENTS DETECTION Kleisarchaki Sofia.

Slides:



Advertisements
Similar presentations
Automatic Timeline Generation from News Articles Josh Taylor and Jessica Jenkins.
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Cognitive Modelling – An exemplar-based context model Benjamin Moloney Student No:
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
Title Course opinion mining methodology for knowledge discovery, based on web social media Authors Sotirios Kontogiannis Ioannis Kazanidis Stavros Valsamidis.
Information Retrieval in Practice
Information Retrieval Review
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002.
Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Recommender systems Ram Akella November 26 th 2008.
CONTENT-BASED BOOK RECOMMENDING USING LEARNING FOR TEXT CATEGORIZATION TRIVIKRAM BHAT UNIVERSITY OF TEXAS AT ARLINGTON DATA MINING CSE6362 BASED ON PAPER.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
More than words: Social networks’ text mining for consumer brand sentiments A Case on Text Mining Key words: Sentiment analysis, SNS Mining Opinion Mining,
The 2nd International Conference of e-Learning and Distance Education, 21 to 23 February 2011, Riyadh, Saudi Arabia Prof. Dr. Torky Sultan Faculty of Computers.
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Newsjunkie: Providing Personalized Newsfeeds via Analysis of Information Novelty Gabrilovich et.al WWW2004.
Redeeming Relevance for Subject Search in Citation Indexes Shannon Bradshaw The University of Iowa
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Wang-Chien Lee i Pervasive Data Access ( i PDA) Group Pennsylvania State University Mining Social Network Big Data Intelligent.
Data Mining By Dave Maung.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Overview of the TDT-2003 Evaluation and Results Jonathan Fiscus NIST Gaithersburg, Maryland November 17-18, 2002.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
How Useful are Your Comments? Analyzing and Predicting YouTube Comments and Comment Ratings Stefan Siersdorfer, Sergiu Chelaru, Wolfgang Nejdl, Jose San.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Software Architecture Evaluation Methodologies Presented By: Anthony Register.
WEB 2.0 PATTERNS Carolina Marin. Content  Introduction  The Participation-Collaboration Pattern  The Collaborative Tagging Pattern.
Harvesting Social Knowledge from Folksonomies Harris Wu, Mohammad Zubair, Kurt Maly, Harvesting social knowledge from folksonomies, Proceedings of the.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Opinion Observer: Analyzing and Comparing Opinions on the Web
Digital Video Library Network Supervisor: Prof. Michael Lyu Student: Ma Chak Kei, Jacky.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.
Information Design Trends Unit Five: Delivery Channels Lecture 2: Portals and Personalization Part 2.
A System for Automatic Personalized Tracking of Scientific Literature on the Web Tzachi Perlstein Yael Nir.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
2005/09/13 A Probabilistic Model for Retrospective News Event Detection Zhiwei Li, Bin Wang*, Mingjing Li, Wei-Ying Ma University of Science and Technology.
New Event Detection at UMass Amherst Giridhar Kumaran and James Allan.
Opinion Observer: Analyzing and Comparing Opinions on the Web WWW 2005, May 10-14, 2005, Chiba, Japan. Bing Liu, Minqing Hu, Junsheng Cheng.
© 2011 Pearson Education, Inc. All rights reserved. This multimedia product and its contents are protected under copyright law. The following are prohibited.
Unveiling Zeus Automated Classification of Malware Samples Abedelaziz Mohaisen Omar Alrawi Verisign Inc, VA, USA Verisign Labs, VA, USA
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
Introduction to Machine Learning, its potential usage in network area,
Information Retrieval in Practice
Data Mining: Concepts and Techniques
Search Engine Architecture
Sentiment analysis algorithms and applications: A survey
Text Based Information Retrieval
Mining the Data Charu C. Aggarwal, ChengXiang Zhai
Presented by: Prof. Ali Jaoua
From Unstructured Text to StructureD Data
Presentation transcript:

BIBLIOGRAPHY ON EVENTS DETECTION Kleisarchaki Sofia

Contents 1. Events, Topics, Entities and Dynamics  Event Detection  Topic & Entity Extraction  Dynamics in Perception  Multimedia Topic & Entity Extraction 2. Opinion Mining  Opinion Mining from Text  Opinion Mining from Multimedia Objects 3. Intelligent Content Acquisition Support  Crawling the hidden Web  Focused and topical crawling  Information extraction in semi-structured Web pages 4. Social Web Mining and Massive Collaboration  Analyzing social network structure  Finding high-quality items and influential people in social media  Searching within a context  Massive Collaboration

Event Detection  TDT and NED New Event Detection (NED): task for detecting stories about previously unseen events in a stream of news stories. NED is one of the tasks in the Topic Detection and Tracking (TDT) program. TDT: The TDT program seeks to develop technologies that search, organize and structure multilingual news-oriented textual materials from a variety of broadcast news media. TF-IDF is the prevailing technique for document representation and term weighting. TDT NED

Event Detection  NED Example – Sinking of an oil tanker The first story on the topic would be the article that first reports the sinking of the tanker itself. Other stories on the same topic would be those discussing the environmental damage, the salvaging efforts, the commercial impact and so on. A good NED system would be one that correctly identifies the article that first reports the sinking as the first story.

Event Detection - Common Approach  On-line systems, computes the similarity between the incoming document and the known events.  They apply a threshold to make decision whether the incoming document is the first story of a new event or a story of some known event.  [Brants & Chen, 2003]: “A system for new Event Detection”

Modifications to Common Approach 1. Better representation of contents -New distance metrics (i.e Hellinger) -Classify documents into different categories -Usage of named entities -Re-weight of terms 2. Utilizing of time information -Usage of chronological order of documents -Usage of decaying functions to modify similarity metrics of the contents

Event Detection [Brants & Chen, 2003]: “A system for new Event Detection”  Present a new method and system for performing the NED task, in one or multiple streams of news stories. All stories on a previously unseen (new) event are marked. Incremental TF/IDF model.

Incremental TF-IDF Model  Pre-Processing step  df(w) are not static but change in time steps t  df t (w) = df t-1 (w) + df Ct (w) (1), where df Ct (w) denote the document frequencies in the newly added set of documents Ct.  The initial document frequencies df 0 (w) are generated from a (possibly empty) training set.  Low frequency terms w tend to be uninformative. Use terms with: df t (w) >= θ d

Term Weighting  The document frequencies are used to calculate weights for the terms w in the documents d.  Or

Similarity Calculations  The vectors consisting of normalized term weights weight t are used to calculate the similarity between two documents d and q.  Or Hellinger distance

Making a decision  In order to decide whether a new document q that is added to the collection at time t describes a new event, it is individually compared to all previous documents d. We identify the document d ∗ with highest similarity to q:  d ∗ = argmax d sim t (q, d)  The value is used to determine whether a document q is about a new event  score(q) = 1 − sim t (q, d ∗ )  if score(q) >= θ s then YES else NO

Improvements  Documents in the stream of news stories may stem from different sources. Each of the sources might have somewhat different vocabulary usage.  df s,t (w), for source s at time t.  The frequencies are updated according to equation (1), but only using those documents in Cn that are from the same source s.

Document Similarity Normalization  A high similarity of a broad topic document to some other document generally does not mean the same as a high similarity of a narrow topic document to some other document., the average similarity of the current document q to all previous documents in the collection.

Source-Pair Specific On-Topic Similarity Normalization  Documents that stem from the same source and that describe the same event tend to have a higher similarity than documents that stem from different sources and also describe the same event because of vocabulary conventions the sources adhere to. , where a, b, and c from sources A, B, and C  Es(q),s(d) : average similarity of stories on the same event from the particular source pair that q and d are drawn from. S(q) and s(d) denote sources of q and d.

Using Inverse Event Frequencies of Terms  ROI (Rules of Interpretation): higher-level categorization of the events.  Terms (in the same ROI) that are highly informative about an event (e.g., Cardoso, the name of the former Brazilian president) should receive higher weights than others (e.g. Election).  where ef(r,w) is the number of events that belong to ROI r and that contain term w.

Matching Parts of Documents  Two documents may only partially overlap, even though they are on the same event.  We calculate the similarity score of each segment in one document to each segment in the other document.,where s 1, s 2 are the segments in q and d

Experiments  Data Sets  TDT3 (training set: TDT2)  TDT4 (training set: TDT2, TDT3)  Evaluation Metric  Results  The best system has a topic-weighted minimum normalized cost of

Things That Did not Help 1. Look Ahead (deferral period - 1, 10, or 100 files)  Best results for deferral period = 1 Low df(w), high idf(w) The lower weight of new terms hurts performance since new words are usually a good indicator of new events. 2. Using time information  The model uses a window on history of size m:

Event Detection [Kumaran & Allan, 2004]: “Text Classification & Named Entities for New Event Detection” False alarms are caused when an old story is assigned a low score. Misses, which are more costly than false alarms, are caused when a new story is assigned a high score.  An in-depth look at misses revealed that it was important to isolate the named entities and treat them preferentially.

Event Detection  To understand the utility of named entities we present two examples. 1. Stories about different events can lead to high IDF, cause to common words. This can be avoided if, for example, we give greater attention to the location named entities. 2. Stories about different topics can lead to high similarity, cause to common location named entity.  Named entities are a double-edged sword, and deciding when to use them can be tricky.

Event Detection  α, β, γ : three vector representations of each document.  a: All terms in document  β : Named entities (Event, GPE, Language, Location, Nationality, Organization, Person, Cardinal, Ordinal, Date, and Time)  γ : Non named entity terms  Named entities were identified using BBN Identifinder. We considered only the Event, GPE, Language, Location, Nationality, Organization, Person, Cardinal, Ordinal, Date, and Time named entities to create β.

Event Detection  On an average it is not named entities that matter more in finally detecting new Election stories, but the rest of the terms.

Event Detection  It is more useful to use the β score as an additional metric than the γ score.

Event Detection  Unfortunately, making such clear cut decisions for all categories is not possible.

2. Opinion Mining  Opinion mining concerns the automatic identification and extraction of opinions, emotions, and sentiments from:  Text Main activities: Analyzing product reviews, identifying opinionated documents, sentences and opinion holders.  Multimedia Objects Current research in this area has investigated two areas in particular. Firstly, there has been work in the area of automatic facial expression recognition. Secondly, there has been some work on associating low-level image features with emotions and sentiments.

2. Opinion Mining  Research in the field of opinion mining has typically focused on methods for detecting sentiment in a generalized way, such as the overall polarity (negative or positive) of user sentiment.  Typical approaches use supervised machine learning methods trained on human-annotated data, co-occurrence statistics, lexicons of positive and negative words and numeric ratings of product reviews (e.g. stars).

Opinion Mining from Text “Opinion Observer: Analyzing and Comparing Opinions on the Web”  Opinion Observer: an analysis system with a visual component to compare consumer opinions.

Technical Tasks 1. Identifying product features that customers have expressed their (positive or negative) opinions on. 2. For each feature, identifying whether the opinion from each reviewer is positive or negative.  Main Review Formats  Format (1) - Pros and Cons.  Format (2) - Pros, Cons and detailed review  Format (3) - free format

‘Algorithm’ Stages  Stage 1: Extracting & analyzing customer reviews in 2 steps:  Download reviews in database (update periodically)  All new reviews of every product are analyzed Identify product features Identify opinions  Stage 2: Users can visualize and compare opinions of different products using a user interface.

Problem Statement  P = {P1, P2, …, Pn}: a set of products  Each product Pi has a set of reviews Ri = {r1, r2, …, rk}  Each ri is a sequence of sentences rj =  Definition (product feature): A product feature f in rj is an attribute/component of the product that has been commented on in rj. If f appears in rj, it is called an explicit feature in rj. If f does not appear in rj but is implied, it is called an implicit feature in rj.  “Battery life too short” (f=battery – explicit)  “This camera is too large” (f=size – implicit)

System Architecture  Review extraction: It extracts all reviews from the given URLs and put them in the database.  Raw reviews: these are the original reviews extracted from the user- supplied sources on the Web.  Processed reviews: These are reviews that have been processed by the automatic techniques and/or interactively tagged (corrected) by the analyst(s).  Analyst: corrects any errors interactively using the UI.

4. Social Web Mining and Massive Collaboration  Analyzing social network structure  One key research topic is the search for regularities in the way social networks evolve over time.  Another current topic is community detection.

4. Social Web Mining and Massive Collaboration  Finding high-quality items and influential people in social media  The quality of user-generated content varies drastically from excellent to abuse and spam. The task of identifying high-quality content in sites based on user contributions - social media sites - becomes increasingly important. Influence propagation. Developing methodologies to assess the quality of content provided in user-generated sites. Identify leaders and followers on a social network.

4. Social Web Mining and Massive Collaboration  Massive Collaboration  The idea of "social minds" has acquired fame and popularity these last five years under the concept of the "wisdom of crowds”, that applies to social tasks in general.  The power behind people is due to a combination of opinion diversity and independence plus a decentralized aggregation mechanism.