Topics Detection and Tracking Presented by CHU Huei-Ming 2004/03/17.

Topics Detection and Tracking Presented by CHU Huei-Ming 2004/03/17

2 Reference Pattern Recognition in Speech and Language Processing –Chap. 12 Modeling Topics for Detection and Tracking –James Allan –University of Massachusetts Amherst –Publisher:CRC Pr I Llc Published 2003/02 UMass at TDT 2004 –Margaret Connel, Ao Feng, Giridhar Kumaran, Hema Raghavan, Chirag Shah, James Allan –University of Massachusetts Amherst –TDT 2004 workshop

3 Topic Detection and Tracking (1/6) The goal of TDT research is to organize news stories by the events that they describe. The TDT research program began in 1996 as a collaboration between Carnegie Mellon University, Dragon Systems, the University of Massachusetts and DARPA To find out how well classic IR technologies addressed TDT, they created a small collection of news stories and identified some topics within them

4 Topic Detection and Tracking (2/6) Event –something that happen at some specific time and place, along with all necessary preconditions and unavoidable consequenes Topic –capture the larger set of happenings that are related to some triggering event –By forcing the additional events to be directly related, the topic is prevented from spreading out to include too much news

5 Topic Detection and Tracking (3/6) TDT Tasks –Segmentation Break an audio track into discrete stories, each on a single topic –Cluster Detection (Detection) Place all arriving news stories into groups based on their topics If no existing group ’ s, the system must decide whether to create a new topic Each story is placed in precisely one cluster –Tracking Starts with a small set of news stories that a user has identified as being on the same topic The system must monitor the stream of arriving news to find all additional stories on the same topic

6 Topic Detection and Tracking (4/6) –New Event Detection (first story detection) Focuses on the cluster creation aspect of cluster detection Evaluated on its ability to decide when a new topic (event) appears –Link Detection Determine weather or not two randomly presented stories discuss the same topic The solution of this task could be used to solve new event detection

7 Topic Detection and Tracking (5/6) Corpora TDT-2 : in 2002 is being augmented with some Arabic news from the same time period TDT-3 : it is created for 1999 evaluation, and stories from four Arabic sources are being added during 2002 StageSourceNumber of stories Topicsduration Pilot studyCNN, Reuters16,000251994 7~12,1995 1~6 TDT-2six English source and three Chinese source 80,0001001998 1~6 TDT-3eight English source and three Chinese source 40,0001201998 10~12 TDT-4eight English source, three Chinese source and four Arabic sources 45,000602000 10~12, 2001 1

8 Topic Detection and Tracking (6/6) Evaluation –P(target) is the prior probability that a story will be on topic –C x are the user-specified values that reflect the cost associated with each error –P(miss) and P(fa) is the actual system error rates –Within TDT evaluations, C miss =10, C fa =1 –P(target) = 1- P(off-toget) = 0.02 (derived from training data)

9 Basic Topic Model Vector Space –Represent items as (stories or topics) as vector in a high dimensional space –The most common comparison function is the cosine of the angle between the two vectors Language Models –A topic is represented as a probability distribution of words –The initial probability estimates come form the maximum likelihood estimate based on the document Use of topic model –See how likely the particular story could be generated by the model –Compare them directly : symmetric version of Kullback-Leibler divergence

10 Implementing the Models (1/3) Name Entities –News is usually about people, so it seems reasonable that their names could be treated specially –Treat the name entities as a separate part of the model and then merge the part –Boost the weight of any words in the stories that come from names, give them a larger contribution to the similarity when the names are in common –Improve the result slightly, no strong stress so far

11 Implementing the Models (2/3) Document Expansion –In the segmentation task, a possible segmentation boundary could be checked by comparing the models generated by text on either side –The text could be used as a query to retrieve a few dozen related stories and then the most frequently occurring words from those stories could be used for the comparison –Relevance models results in substantial improvements in the link detection task

12 Implementing the Models (3/3) Time decay –The likelihood that two stories discuss the same topic diminished as the stories are further separated in time –In a vector space model, the cosine similarity function can be changed so that it include a time decay

13 Comparing model (1/3) Nearest Neighbors –In the vector space model, a topic might be represented as a single vector –To determine whether or not that story is on any of the existing topics we consider the distance between the story ’ s vector and the closest topic vector –If it falls outside the specified distance, the story is likely to be the seed of a new topic and a new vector can be formed

14 Comparing model (2/3) Decision Trees –The best place of decision trees within TDT may be the segmentation task –There are numerous training instances (hand-segmented stories) –Finding features that are indicative of a story boundary is possible and achieves good quality

15 Comparing model (3/3) Model-to-Model –Direct comparison of statistical language models that represent topics –Kullback-Leibler idvergence –To finesse the measure, calculate the both ways and add them together –One approach that has been used to incorporate that notion penalized the comparison if the models are too much like background news

16 Miscellaneous Issues (1/3) Deferral –All of tasks are envisioned as “ on-line ” task –The decision about a story is expected before the next story is presented –In fact, TDT provides a moderate amount of look ahead for the tasks –First, stories are always presented to the system grouped into “ files ” that correspond to about a half hour of news –Second, the formal TDT evaluation incorporates a notion of deferral that allows a system to explore the advantage of deferring decisions until several files have passed.

17 Miscellaneous Issues (2/3) Multi-modal Issues –TDT systems must deal with are either written text (newswire) or read text (audio) –Speech recognizers make numerous mistakes, inserting, deleting, and even completely transforming words into other words –The difference of the two modes is the score normalization –The pair of story drawn from different source the distribution is different, in order the score is comparable, a system needs to normalize depends on those modes

18 Miscellaneous Issues (3/3) Multi-lingual Issues –The TDT research program has strong interest in evaluating the tasks across multiple languages –1999~2001 sites were required to handle English and Chinese news story –2002 sites will be incorporating Arabic as a third language

19 Using TDT Interactively (1/2) Demonstrations –Lighthouse is a prototype system that visually portrays inter-document similarities to help the user find relevant material more quickly

20 Using TDT Interactively (2/2) Timelines –Using a timeline to show not only what the topic are, but how they occur in time –Using X 2 measure to determine whether or not that feature is occurring on that day in a unusual way

21 UMass at TDT 2004 Hierarchical Topic Detection Topic Tracking New Event Detection Link Detection

22 Hierarchical Topic Detection Model Description (1/8) This task replaces Topic Detection in previous TDT evaluations Used vector space model as the based line Bounded clustering to reduce time complexity and had some simple parameter tuning Stories in the same event tend to be close in time, we only need to compare a story to its “ local ” stories instead of the whole collection Two steps –Bounded 1-NN for event formation –Bounded agglomerative clustering for building the hierarchy

23 Hierarchical Topic Detection Model Description (2/8) Bounded 1-NN for event formation –All stories in the same original language and from the some source are taken out and time ordered –Stories are processed one by one and each incoming story is compared to a certain number of stories(100 for baseline) before it. –Similarity of the current story and the most similar previous story is lager than a given threshold (0.3 for baseline) the current story will be assigned to the event that the most similar previous story belongs to, otherwise, a new event is created –There is a list of events for each source/language class –The event within each class are sorted by time according to the time stamp of the first story

24 Hierarchical Topic Detection Model Description (3/8) Bounded 1-NN for event formation S2S2 S1S1 S3S1S1 S2S2 Language ALanguage B

25 Hierarchical Topic Detection Model Description (4/8) Each source is segmented in to several parts, and sorted by time according to the time stamp of the first story Sorted event list

26 Hierarchical Topic Detection Model Description (5/8) Bounded agglomerative clustering for building the hierarchy –Take a certain number of events (the number is called WSIZE default is 120) from the sorted event list –At each iteration, find the closest event pair and combine the later event to the earlier one

27 Hierarchical Topic Detection Model Description (6/8) Each iteration find the closest event pair and combine the later event to the earlier one I1I1 I2I2 I3I3 I r-1 IrIr

28 Hierarchical Topic Detection Model Description (7/8) Bounded agglomerative clustering for building the hierarchy –Continues for (BRANCH-1)WSIZE/BRANCH iterations, so the number of clusters left is WSIZE/BRANCH –Take the first half out and get WSIZE/2 new events and agglomerative cluster until WSIZE/BRANCH clusters left –The optimal value is around 3, BRANCH=3 as baseline

29 Hierarchical Topic Detection Model Description (8/8) Then all clusters in the same language but from difference sources are combined Finally clusters from all languages are mixed and clustered until only one cluster is left, which become the root Used machine translation for Arabic and Mandarin stories to simplify the similarity calculation

30 Hierarchical Topic Detection Training (1/4) Training corpus : TDT4 – newswire and broadcast stories Testing corpus : TDT5 – newswire only Taking newswire stories from the TDT4 corpus includes NYT, APW, ANN, ALH, AFP, ZBN, XIN 420,000 stories TDT-4 Corpus Overview

31 Hierarchical Topic Detection Training (2/4) TDT-5 Corpus Content LanguageSourceDoc Count ArabicAFA (Agence France Presse)30,593 ArabicANN (An-Nahar)8162 ArabicUMM (Ummah)1104 ArabicXIA (Xinhua)33,051 ArabicTotal72,910 EnglishAFE (Agence France Presse)95,432 EnglishAPE (Associated Press)104,941 EnglishCNE (CNN)1117 EnglishLAT (LA Times/Washington Post)6692 EnglishNYT (New York Times)12,024 EnglishUME (Ummah)1101 EnglishXIE (Xinhua)56,802 EnglishTotal278,109 MandarinAFC (Agence France Presse)5655 MandarinCNA (China News Agency)4569 MandarinXIN (Xinhua)37,251 MandarinZBN (Zaobao News)9011 MandarinTotal56,486 CorpusTotal407,505

32 Hierarchical Topic Detection Training (3/4) Parameters –BRANCH : average branching factor in the bounded agglomerative clustering algorithm –Threshold : in the event formation to decide if a new event will be created –STOP : in each source, the number of cluster is smaller than square root of the number of story –WSIZE : the maximum window size in agglomerative clustering –NSTORY: Each story will be compared to at most NSTORY stories before it in the 1-NN event clustering, the idea comes from the time locality in event threading

33 Hierarchical Topic Detection Training (4/4) Among the clusters very close to the root node, some contains thousands of stories. Both 1-NN and agglomerative clustering algorithms favor large clusters Modified the similarity calculation to give smaller clusters more chances Sim(v1,v2) is the similarity of the cluster centroids |cluster1| is the number of stories in the first story a is a constant to control how much favor smaller clusters can get

34 Hierarchical Topic Detection Result (1/2) Three runs for each condition: UMASSv1, UMASSv12 and UMASSv19 ParametersDescriptionUMASSv1UMASSv12UMASSv19 KNN boundPrevious stories compared number 100 SIMSimilarity functionCluster centroidCluster centroid normalized, a=0 The same WEIGHTVector weight sceneThe same THRESHThreshold for KNN0.3 WSIZEMaximum number of clusters in agglomerative clustering 120 240 BRANCHAverage branching factor333 STOPDecides when clusters from different sources are mixed 555

35 Hierarchical Topic Detection Result (2/2) Small branching factor can reduce both detection cost and travel cost Small branching factor, there are more clusters with different granularities The assumption of temporal locality is useful in event threading, more experiments after the submission show larger window size can improve performance

36 Conclusion Discussed several of the techniques that systems have used to build or enhance those models and listed merits of many of them The TDT researchers can extent to which IR technology can be used to solve TDT problems

Topics Detection and Tracking Presented by CHU Huei-Ming 2004/03/17.

Similar presentations

Presentation on theme: "Topics Detection and Tracking Presented by CHU Huei-Ming 2004/03/17."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Topics Detection and Tracking Presented by CHU Huei-Ming 2004/03/17.

Similar presentations

Presentation on theme: "Topics Detection and Tracking Presented by CHU Huei-Ming 2004/03/17."— Presentation transcript:

Similar presentations

About project

Feedback