Topic Detection and Tracking : Event Clustering as a Basis for First Story Detection AI-Lab Jung Sung Won.

Topic Detection and Tracking : Event Clustering as a Basis for First Story Detection AI-Lab Jung Sung Won

Abstract  Topic Detection and Tracking(TDT)  The organization of information by event than by subject  In this paper  Overview of the TDT research program  Discuss our approach to two of the TDT problems  Event clustering (Detection)  First story detection

Introduction  Information Retrieval Research  Texts are usually indexed, retrieved and organized on the basis of their subjects.  This research  Focusing on the events that are described by the text rather than the broader subject it covers.  What is the major event discussed within this story?  Do these texts discuss the same event?  Not all texts can be reduced to a set of events  This work will necessarily apply only to text that have an event focus : announcements, news

Topic Detection and Tracking  연구의 목적  방송 기사들을 사건으로 구성하기 위하여  기사의 수집처 : 텔레비전, 라디오, 유선방송  Automatic speech recognition (ASR) 필요  TDT 연구는 3 차례에 걸처 이루어짐  TDT-1 : 1996 년 중반 ~ 1997 년  TDT-2 : 1998 년  TDT-3 : 1999 년

TDT-1, The Pilot Study(1/2)  Proof-of-concept 을 위한 노력  1 st Project : definition of the problem  Some unique thing that happens at some point in time. (Allan et al., 1998a)  Ex) “computer virus detected at British Telecom, March 3, 1993” ↔ “computer virus outbreaks”  Definition of three research problems  Segmentation  Detection  Event clustering  First story detection  Tracking

TDT-1, The Pilot Study(2/2)  created a small evaluation corpus  15,683 개의 뉴스기사 ( CNN, Reuter )  generated a set of 25 news topics  employed a two-prong method for assigning relevance judgments between topics and stories  1 group : read every story in the corpus  2 group : used a search engine to look for stories on each of the 25 topics

TDT-2, A Full Evaluation  The Primary goal  Create a full-scale evaluation of the TDT tasks began in the pilot study.  Change  The two detection tasks were “merged” to create an on-line version of the event clustering task.  TDT-1 와 다른 점은 small group 으로 처리  수정된 평가 방법 사용 ( Detection Error Tradeoff graphs)  큰 규모의 corpus 를 사용

TDT-3, Multi-Lingual TDT  TDT-2 와 다른 점  Task 를 고려하여 구성하는 event 의 범위가 다 르다.  multi-lingual 소스의 도입  New evaluation corpus.

On-Line Clustering Algorithms  Previous clustering work  Retrospective environment  In this study  On-line solution to first story detection.  Steps of Clustering  Converting stories to a vector  Comparing stories and clusters  Applying a threshold to determine if clusters are sufficiently similar  Classifier : the combination of vector and threshold

Related Clustering Work  기존 clustering 기법들  Agglomerative hierarchical clustering  Probabilistic approaches  이 기법들은 clustering 할 대상이 미리 있음  On-line environment 에서의 한계점  Clustering 할 대상이 미리 없음  일부 알고리즘은 클러스터의 수를 정해야 하나 on-line 에서는 클러스터의 수를 알 수 없음  Single-pass clustering 을 할 수 있는 알고리즘이 필요 → 이미 몇몇 알고리즘이 있음

Creating Story Vectors  INQUERY 를 사용해서 weight vector 구성  t : 기사에 특정 lexical feature 등장 횟수  dl : the story’s length in words  avg_dl : 기사 내의 term 개수 평균  C : 보조 corpus 내의 기사의 수  df : term 이 나타난 기사의 수 ( df=0 이면 df=1)  k : classifier 와 기사에 동시에 나타나는 단어의 index  d j,k : 시간 j 에 나타날 기사의 유사도

Comparing Clusters  Comparing a story to a cluster or the contents of two clusters  Single-link, complete-link, group-average

Thresholds for Merging  Threshold 를 쓰는 이유  The decision for generating a new cluster.  Clustering for first story detection  Ex) threshold 0.5 인 경우  Time-based thresholds  실제 news 의 시간적 특성 고려  시간 i 에 계산된 classifier 에서 시간 j 에 도달한 어떤 기사를 위한 threshold 는  Decision Scores

Experimental Setting – Data

Evaluation Measures  Measures of Text classification effectiveness  Recall and precision  Misses : the system does not detect a new event  False alarms : the system indicates a story contains a new event when in truth it does not  F1-Measuer (Lewis and Gale, 1994) : 2PR/(P+R)  TDT Cost function  P(fa) : the system false alarm rate  P(m) : miss probability  P(topic) : the prior probability that a story is relevant to a topic  cost fa = cost m = 1.0

Event Clustering

First Story Detection  New topic : a topic whose event has not been previously reported  Motivation  The property of time as a distinguishing feature of this domain  The name of the people, places, dates, and things : who, what, when, where  Method  Use event clustering method  If no classifier comparison results in a positive classification decision for the current story, then the current story has content not previously encountered, and thus it contains discussion of a new topic  Difference : finding the start of each topic  On-line single link + time strategy

First Story Detection Experiment

Discussion of First Story Detection  The Good News  Low false alarm rates  The Bad News  경험적으로 점진적인 증가만을 기대할 수 있다.  The limitation of the word-co-occurrence model  Association with topics that heavily covered in the news

Topic Detection and Tracking : Event Clustering as a Basis for First Story Detection AI-Lab Jung Sung Won.

Similar presentations

Presentation on theme: "Topic Detection and Tracking : Event Clustering as a Basis for First Story Detection AI-Lab Jung Sung Won."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Topic Detection and Tracking : Event Clustering as a Basis for First Story Detection AI-Lab Jung Sung Won.

Similar presentations

Presentation on theme: "Topic Detection and Tracking : Event Clustering as a Basis for First Story Detection AI-Lab Jung Sung Won."— Presentation transcript:

Similar presentations

About project

Feedback