Topics Detection and Tracking Presented by CHU Huei-Ming 2004/03/17.

Slides:



Advertisements
Similar presentations
DISCOVERING EVENT EVOLUTION GRAPHS FROM NEWSWIRES Christopher C. Yang and Xiaodong Shi Event Evolution and Event Evolution Graph: We define event evolution.
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Heuristic Search techniques
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Chapter 5: Introduction to Information Retrieval
Albert Gatt Corpora and Statistical Methods Lecture 13.
PARTITIONAL CLUSTERING
Imbalanced data David Kauchak CS 451 – Fall 2013.
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very.
UMass Amherst at TDT 2003 James Allan, Alvaro Bolivar, Margie Connell, Steve Cronen-Townsend, Ao Feng, FangFang Feng, Leah Larkey, Giridhar Kumaran, Victor.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Introduction to Automatic Classification Shih-Wen (George) Ke 7 th Dec 2005.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
What is Cluster Analysis?
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Clustering. What is clustering? Grouping similar objects together and keeping dissimilar objects apart. In Information Retrieval, the cluster hypothesis.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Overview of Search Engines
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Clustering Unsupervised learning Generating “classes”
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Introduction The large amount of traffic nowadays in Internet comes from social video streams. Internet Service Providers can significantly enhance local.
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
MINING RELATED QUERIES FROM SEARCH ENGINE QUERY LOGS Xiaodong Shi and Christopher C. Yang Definitions: Query Record: A query record represents the submission.
Adaptive News Access Daniel Billsus Presented by Chirayu Wongchokprasitti.
Multimodal Interaction Dr. Mike Spann
ArrayCluster: an analytic tool for clustering, data visualization and module finder on gene expression profiles 組員:李祥豪 謝紹陽 江建霖.
Newsjunkie: Providing Personalized Newsfeeds via Analysis of Information Novelty Gabrilovich et.al WWW2004.
Incident Threading for News Passages (CIKM 09) Speaker: Yi-lin,Hsu Advisor: Dr. Koh, Jia-ling. Date:2010/06/14.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
UMass at TDT 2000 James Allan and Victor Lavrenko (with David Frey and Vikas Khandelwal) Center for Intelligent Information Retrieval Department of Computer.
Overview of the TDT-2003 Evaluation and Results Jonathan Fiscus NIST Gaithersburg, Maryland November 17-18, 2002.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
TDT 2002 Straw Man TDT 2001 Workshop November 12-13, 2001.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Mining Binary Constraints in Feature Models: A Classification-based Approach Yi Li.
Visual Categorization With Bags of Keypoints Original Authors: G. Csurka, C.R. Dance, L. Fan, J. Willamowski, C. Bray ECCV Workshop on Statistical Learning.
Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan,
Results of the 2000 Topic Detection and Tracking Evaluation in Mandarin and English Jonathan Fiscus and George Doddington.
Internal and Confidential Cognos CoE COGNOS 8 – Event Studio.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.
Personalization Services in CADAL Zhang yin Zhuang Yuting Wu Jiangqin College of Computer Science, Zhejiang University November 19,2006.
2005/09/13 A Probabilistic Model for Retrospective News Event Detection Zhiwei Li, Bin Wang*, Mingjing Li, Wei-Ying Ma University of Science and Technology.
New Event Detection at UMass Amherst Giridhar Kumaran and James Allan.
Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
TDT 2004 Unsupervised and Supervised Tracking Hema Raghavan UMASS-Amherst at TDT 2004.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Hierarchical Topic Detection UMass - TDT 2004 Ao Feng James Allan Center for Intelligent Information Retrieval University of Massachusetts Amherst.
KNN & Naïve Bayes Hongning Wang
Paul van Mulbregt Sheera Knecht Jon Yamron Dragon Systems Detection at Dragon Systems.
Complex Numbers and Equation Solving 1. Simple Equations 2. Compound Equations 3. Systems of Equations 4. Quadratic Equations 5. Determining Quadratic.
Unsupervised Learning
SIMILARITY SEARCH The Metric Space Approach
Data Mining K-means Algorithm
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Text Categorization Berlin Chen 2003 Reference:
Unsupervised Learning
Presentation transcript:

Topics Detection and Tracking Presented by CHU Huei-Ming 2004/03/17

2 Reference Pattern Recognition in Speech and Language Processing –Chap. 12 Modeling Topics for Detection and Tracking –James Allan –University of Massachusetts Amherst –Publisher:CRC Pr I Llc Published 2003/02 UMass at TDT 2004 –Margaret Connel, Ao Feng, Giridhar Kumaran, Hema Raghavan, Chirag Shah, James Allan –University of Massachusetts Amherst –TDT 2004 workshop

3 Topic Detection and Tracking (1/6) The goal of TDT research is to organize news stories by the events that they describe. The TDT research program began in 1996 as a collaboration between Carnegie Mellon University, Dragon Systems, the University of Massachusetts and DARPA To find out how well classic IR technologies addressed TDT, they created a small collection of news stories and identified some topics within them

4 Topic Detection and Tracking (2/6) Event –something that happen at some specific time and place, along with all necessary preconditions and unavoidable consequenes Topic –capture the larger set of happenings that are related to some triggering event –By forcing the additional events to be directly related, the topic is prevented from spreading out to include too much news

5 Topic Detection and Tracking (3/6) TDT Tasks –Segmentation Break an audio track into discrete stories, each on a single topic –Cluster Detection (Detection) Place all arriving news stories into groups based on their topics If no existing group ’ s, the system must decide whether to create a new topic Each story is placed in precisely one cluster –Tracking Starts with a small set of news stories that a user has identified as being on the same topic The system must monitor the stream of arriving news to find all additional stories on the same topic

6 Topic Detection and Tracking (4/6) –New Event Detection (first story detection) Focuses on the cluster creation aspect of cluster detection Evaluated on its ability to decide when a new topic (event) appears –Link Detection Determine weather or not two randomly presented stories discuss the same topic The solution of this task could be used to solve new event detection

7 Topic Detection and Tracking (5/6) Corpora TDT-2 : in 2002 is being augmented with some Arabic news from the same time period TDT-3 : it is created for 1999 evaluation, and stories from four Arabic sources are being added during 2002 StageSourceNumber of stories Topicsduration Pilot studyCNN, Reuters16, ~12,1995 1~6 TDT-2six English source and three Chinese source 80, ~6 TDT-3eight English source and three Chinese source 40, ~12 TDT-4eight English source, three Chinese source and four Arabic sources 45, ~12,

8 Topic Detection and Tracking (6/6) Evaluation –P(target) is the prior probability that a story will be on topic –C x are the user-specified values that reflect the cost associated with each error –P(miss) and P(fa) is the actual system error rates –Within TDT evaluations, C miss =10, C fa =1 –P(target) = 1- P(off-toget) = 0.02 (derived from training data)

9 Basic Topic Model Vector Space –Represent items as (stories or topics) as vector in a high dimensional space –The most common comparison function is the cosine of the angle between the two vectors Language Models –A topic is represented as a probability distribution of words –The initial probability estimates come form the maximum likelihood estimate based on the document Use of topic model –See how likely the particular story could be generated by the model –Compare them directly : symmetric version of Kullback-Leibler divergence

10 Implementing the Models (1/3) Name Entities –News is usually about people, so it seems reasonable that their names could be treated specially –Treat the name entities as a separate part of the model and then merge the part –Boost the weight of any words in the stories that come from names, give them a larger contribution to the similarity when the names are in common –Improve the result slightly, no strong stress so far

11 Implementing the Models (2/3) Document Expansion –In the segmentation task, a possible segmentation boundary could be checked by comparing the models generated by text on either side –The text could be used as a query to retrieve a few dozen related stories and then the most frequently occurring words from those stories could be used for the comparison –Relevance models results in substantial improvements in the link detection task

12 Implementing the Models (3/3) Time decay –The likelihood that two stories discuss the same topic diminished as the stories are further separated in time –In a vector space model, the cosine similarity function can be changed so that it include a time decay

13 Comparing model (1/3) Nearest Neighbors –In the vector space model, a topic might be represented as a single vector –To determine whether or not that story is on any of the existing topics we consider the distance between the story ’ s vector and the closest topic vector –If it falls outside the specified distance, the story is likely to be the seed of a new topic and a new vector can be formed

14 Comparing model (2/3) Decision Trees –The best place of decision trees within TDT may be the segmentation task –There are numerous training instances (hand-segmented stories) –Finding features that are indicative of a story boundary is possible and achieves good quality

15 Comparing model (3/3) Model-to-Model –Direct comparison of statistical language models that represent topics –Kullback-Leibler idvergence –To finesse the measure, calculate the both ways and add them together –One approach that has been used to incorporate that notion penalized the comparison if the models are too much like background news

16 Miscellaneous Issues (1/3) Deferral –All of tasks are envisioned as “ on-line ” task –The decision about a story is expected before the next story is presented –In fact, TDT provides a moderate amount of look ahead for the tasks –First, stories are always presented to the system grouped into “ files ” that correspond to about a half hour of news –Second, the formal TDT evaluation incorporates a notion of deferral that allows a system to explore the advantage of deferring decisions until several files have passed.

17 Miscellaneous Issues (2/3) Multi-modal Issues –TDT systems must deal with are either written text (newswire) or read text (audio) –Speech recognizers make numerous mistakes, inserting, deleting, and even completely transforming words into other words –The difference of the two modes is the score normalization –The pair of story drawn from different source the distribution is different, in order the score is comparable, a system needs to normalize depends on those modes

18 Miscellaneous Issues (3/3) Multi-lingual Issues –The TDT research program has strong interest in evaluating the tasks across multiple languages –1999~2001 sites were required to handle English and Chinese news story –2002 sites will be incorporating Arabic as a third language

19 Using TDT Interactively (1/2) Demonstrations –Lighthouse is a prototype system that visually portrays inter-document similarities to help the user find relevant material more quickly

20 Using TDT Interactively (2/2) Timelines –Using a timeline to show not only what the topic are, but how they occur in time –Using X 2 measure to determine whether or not that feature is occurring on that day in a unusual way

21 UMass at TDT 2004 Hierarchical Topic Detection Topic Tracking New Event Detection Link Detection

22 Hierarchical Topic Detection Model Description (1/8) This task replaces Topic Detection in previous TDT evaluations Used vector space model as the based line Bounded clustering to reduce time complexity and had some simple parameter tuning Stories in the same event tend to be close in time, we only need to compare a story to its “ local ” stories instead of the whole collection Two steps –Bounded 1-NN for event formation –Bounded agglomerative clustering for building the hierarchy

23 Hierarchical Topic Detection Model Description (2/8) Bounded 1-NN for event formation –All stories in the same original language and from the some source are taken out and time ordered –Stories are processed one by one and each incoming story is compared to a certain number of stories(100 for baseline) before it. –Similarity of the current story and the most similar previous story is lager than a given threshold (0.3 for baseline) the current story will be assigned to the event that the most similar previous story belongs to, otherwise, a new event is created –There is a list of events for each source/language class –The event within each class are sorted by time according to the time stamp of the first story

24 Hierarchical Topic Detection Model Description (3/8) Bounded 1-NN for event formation S2S2 S1S1 S3S1S1 S2S2 Language ALanguage B

25 Hierarchical Topic Detection Model Description (4/8) Each source is segmented in to several parts, and sorted by time according to the time stamp of the first story Sorted event list

26 Hierarchical Topic Detection Model Description (5/8) Bounded agglomerative clustering for building the hierarchy –Take a certain number of events (the number is called WSIZE default is 120) from the sorted event list –At each iteration, find the closest event pair and combine the later event to the earlier one

27 Hierarchical Topic Detection Model Description (6/8) Each iteration find the closest event pair and combine the later event to the earlier one I1I1 I2I2 I3I3 I r-1 IrIr

28 Hierarchical Topic Detection Model Description (7/8) Bounded agglomerative clustering for building the hierarchy –Continues for (BRANCH-1)WSIZE/BRANCH iterations, so the number of clusters left is WSIZE/BRANCH –Take the first half out and get WSIZE/2 new events and agglomerative cluster until WSIZE/BRANCH clusters left –The optimal value is around 3, BRANCH=3 as baseline

29 Hierarchical Topic Detection Model Description (8/8) Then all clusters in the same language but from difference sources are combined Finally clusters from all languages are mixed and clustered until only one cluster is left, which become the root Used machine translation for Arabic and Mandarin stories to simplify the similarity calculation

30 Hierarchical Topic Detection Training (1/4) Training corpus : TDT4 – newswire and broadcast stories Testing corpus : TDT5 – newswire only Taking newswire stories from the TDT4 corpus includes NYT, APW, ANN, ALH, AFP, ZBN, XIN 420,000 stories TDT-4 Corpus Overview

31 Hierarchical Topic Detection Training (2/4) TDT-5 Corpus Content LanguageSourceDoc Count ArabicAFA (Agence France Presse)30,593 ArabicANN (An-Nahar)8162 ArabicUMM (Ummah)1104 ArabicXIA (Xinhua)33,051 ArabicTotal72,910 EnglishAFE (Agence France Presse)95,432 EnglishAPE (Associated Press)104,941 EnglishCNE (CNN)1117 EnglishLAT (LA Times/Washington Post)6692 EnglishNYT (New York Times)12,024 EnglishUME (Ummah)1101 EnglishXIE (Xinhua)56,802 EnglishTotal278,109 MandarinAFC (Agence France Presse)5655 MandarinCNA (China News Agency)4569 MandarinXIN (Xinhua)37,251 MandarinZBN (Zaobao News)9011 MandarinTotal56,486 CorpusTotal407,505

32 Hierarchical Topic Detection Training (3/4) Parameters –BRANCH : average branching factor in the bounded agglomerative clustering algorithm –Threshold : in the event formation to decide if a new event will be created –STOP : in each source, the number of cluster is smaller than square root of the number of story –WSIZE : the maximum window size in agglomerative clustering –NSTORY: Each story will be compared to at most NSTORY stories before it in the 1-NN event clustering, the idea comes from the time locality in event threading

33 Hierarchical Topic Detection Training (4/4) Among the clusters very close to the root node, some contains thousands of stories. Both 1-NN and agglomerative clustering algorithms favor large clusters Modified the similarity calculation to give smaller clusters more chances Sim(v1,v2) is the similarity of the cluster centroids |cluster1| is the number of stories in the first story a is a constant to control how much favor smaller clusters can get

34 Hierarchical Topic Detection Result (1/2) Three runs for each condition: UMASSv1, UMASSv12 and UMASSv19 ParametersDescriptionUMASSv1UMASSv12UMASSv19 KNN boundPrevious stories compared number 100 SIMSimilarity functionCluster centroidCluster centroid normalized, a=0 The same WEIGHTVector weight sceneThe same THRESHThreshold for KNN0.3 WSIZEMaximum number of clusters in agglomerative clustering BRANCHAverage branching factor333 STOPDecides when clusters from different sources are mixed 555

35 Hierarchical Topic Detection Result (2/2) Small branching factor can reduce both detection cost and travel cost Small branching factor, there are more clusters with different granularities The assumption of temporal locality is useful in event threading, more experiments after the submission show larger window size can improve performance

36 Conclusion Discussed several of the techniques that systems have used to build or enhance those models and listed merits of many of them The TDT researchers can extent to which IR technology can be used to solve TDT problems