Download presentation
Presentation is loading. Please wait.
Published byRegina Sharp Modified over 9 years ago
1
Topic Detection and Tracking : Event Clustering as a Basis for First Story Detection AI-Lab Jung Sung Won
2
Abstract Topic Detection and Tracking(TDT) The organization of information by event than by subject In this paper Overview of the TDT research program Discuss our approach to two of the TDT problems Event clustering (Detection) First story detection
3
Introduction Information Retrieval Research Texts are usually indexed, retrieved and organized on the basis of their subjects. This research Focusing on the events that are described by the text rather than the broader subject it covers. What is the major event discussed within this story? Do these texts discuss the same event? Not all texts can be reduced to a set of events This work will necessarily apply only to text that have an event focus : announcements, news
4
Topic Detection and Tracking 연구의 목적 방송 기사들을 사건으로 구성하기 위하여 기사의 수집처 : 텔레비전, 라디오, 유선방송 Automatic speech recognition (ASR) 필요 TDT 연구는 3 차례에 걸처 이루어짐 TDT-1 : 1996 년 중반 ~ 1997 년 TDT-2 : 1998 년 TDT-3 : 1999 년
5
TDT-1, The Pilot Study(1/2) Proof-of-concept 을 위한 노력 1 st Project : definition of the problem Some unique thing that happens at some point in time. (Allan et al., 1998a) Ex) “computer virus detected at British Telecom, March 3, 1993” ↔ “computer virus outbreaks” Definition of three research problems Segmentation Detection Event clustering First story detection Tracking
6
TDT-1, The Pilot Study(2/2) created a small evaluation corpus 15,683 개의 뉴스기사 ( CNN, Reuter ) generated a set of 25 news topics employed a two-prong method for assigning relevance judgments between topics and stories 1 group : read every story in the corpus 2 group : used a search engine to look for stories on each of the 25 topics
7
TDT-2, A Full Evaluation The Primary goal Create a full-scale evaluation of the TDT tasks began in the pilot study. Change The two detection tasks were “merged” to create an on-line version of the event clustering task. TDT-1 와 다른 점은 small group 으로 처리 수정된 평가 방법 사용 ( Detection Error Tradeoff graphs) 큰 규모의 corpus 를 사용
8
TDT-3, Multi-Lingual TDT TDT-2 와 다른 점 Task 를 고려하여 구성하는 event 의 범위가 다 르다. multi-lingual 소스의 도입 New evaluation corpus.
9
On-Line Clustering Algorithms Previous clustering work Retrospective environment In this study On-line solution to first story detection. Steps of Clustering Converting stories to a vector Comparing stories and clusters Applying a threshold to determine if clusters are sufficiently similar Classifier : the combination of vector and threshold
10
Related Clustering Work 기존 clustering 기법들 Agglomerative hierarchical clustering Probabilistic approaches 이 기법들은 clustering 할 대상이 미리 있음 On-line environment 에서의 한계점 Clustering 할 대상이 미리 없음 일부 알고리즘은 클러스터의 수를 정해야 하나 on-line 에서는 클러스터의 수를 알 수 없음 Single-pass clustering 을 할 수 있는 알고리즘이 필요 → 이미 몇몇 알고리즘이 있음
11
Creating Story Vectors INQUERY 를 사용해서 weight vector 구성 t : 기사에 특정 lexical feature 등장 횟수 dl : the story’s length in words avg_dl : 기사 내의 term 개수 평균 C : 보조 corpus 내의 기사의 수 df : term 이 나타난 기사의 수 ( df=0 이면 df=1) k : classifier 와 기사에 동시에 나타나는 단어의 index d j,k : 시간 j 에 나타날 기사의 유사도
12
Comparing Clusters Comparing a story to a cluster or the contents of two clusters Single-link, complete-link, group-average
13
Thresholds for Merging Threshold 를 쓰는 이유 The decision for generating a new cluster. Clustering for first story detection Ex) threshold 0.5 인 경우 Time-based thresholds 실제 news 의 시간적 특성 고려 시간 i 에 계산된 classifier 에서 시간 j 에 도달한 어떤 기사를 위한 threshold 는 Decision Scores
14
Experimental Setting – Data
15
Evaluation Measures Measures of Text classification effectiveness Recall and precision Misses : the system does not detect a new event False alarms : the system indicates a story contains a new event when in truth it does not F1-Measuer (Lewis and Gale, 1994) : 2PR/(P+R) TDT Cost function P(fa) : the system false alarm rate P(m) : miss probability P(topic) : the prior probability that a story is relevant to a topic cost fa = cost m = 1.0
16
Event Clustering
17
First Story Detection New topic : a topic whose event has not been previously reported Motivation The property of time as a distinguishing feature of this domain The name of the people, places, dates, and things : who, what, when, where Method Use event clustering method If no classifier comparison results in a positive classification decision for the current story, then the current story has content not previously encountered, and thus it contains discussion of a new topic Difference : finding the start of each topic On-line single link + time strategy
18
First Story Detection Experiment
19
Discussion of First Story Detection The Good News Low false alarm rates The Bad News 경험적으로 점진적인 증가만을 기대할 수 있다. The limitation of the word-co-occurrence model Association with topics that heavily covered in the news
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.