Download presentation
Presentation is loading. Please wait.
1
UMass Amherst at TDT 2003 James Allan, Alvaro Bolivar, Margie Connell, Steve Cronen-Townsend, Ao Feng, FangFang Feng, Leah Larkey, Giridhar Kumaran, Victor Lavrenko, Ramesh Nallapati, and Hema Raghavan Center for Intelligent Information Retrieval Department of Computer Science University of Massachusetts Amherst
2
What we did Tasks Story Link Detection Topic Tracking New Event Detection Cluster Detection
3
Outline Rule of Interpretation (ROI) classification ROI-based vocabulary reduction Cross-language techniques Dictionary translation of Arabic stories Native language comparisons Adaptive tracking Relevance models
4
ROI motivation Analyzed vector space similarity measures Failed to distinguish between similar topics e.g. two “health care” stories from different topics different locations and individuals similarity dominated by “health care” terms drugs, cost, coverage, plan, prescription Possible solution: first categorize stories different category different topics (mostly true) use within-category statistics “health care” may be less confusing Rules of Interpretation provide natural categories
5
ROI intuition Each document in the corpus is classified into one of the ROI categories Stories in different ROIs are less likely to be in same topic. If two stories belong to different ROIs, we should trust their similarities less ROI tagged corpus sim new (s1,s2)=sim old (s1,s2) sim new (s1,s2)<sim old (s1,s2) Sn SnSn
6
ROI classifiers Naïve Bayes BoosTexter [Schapire and Singer, 2000 ] Decision tree classifier Generates and combines simple rules Features are terms with tf as weights Used most likely single class Explored distribution of all classes Unable to do so successfully
7
Training Data for Classification Experiments: train on TDT-2,test on TDT-3 Submissions: train on TDT-2 plus TDT-3 Training data prepared the same way Stories in each topic tagged with topic’s ROI Remove duplicate stories (in topics with the same ROI) Remove all stories with more than one ROI Worst case: a single story relevant to… Chinese Labor Activists with ROI Legal/Criminal Cases Blair Visits China in October with ROI Political/Diplomatic Mtgs. China will not allow Opposition Parties with ROI Miscellaneous Experiments with removing named entities for training
8
Naïve Bayes vs. BoosTexter Similar classification accuracy Overall accuracy is the same Errors are substantially different Our training results (TDT-3) BoosTexter beat Naïve Bayes for SLD and NED BoosTexter used in most tasks for submission Evaluation results: In Link Detection, using Naïve Bayes more useful
9
ROI classes in link detection Given story pair and their estimated ROIs If estimated ROIs are same, leave score alone If they are different, reduce score Reduced to 1/3 of original value based on training runs Used four different ROI classifiers ROI-BT,ne: BoosTexter with named entities ROI-BT, no-ne:BoosTexter without named entities ROI-NB, ne: Naïve Bayes with name entities ROI-NB, no-ne: Naïve Bayes without name entities
10
Training effectiveness (TDT-3) Story Link Detection Minimum normalized cost Various types of databases 1Dcos4DcosUDcos original0.35360.25560.3254 ROI-BT,ne0.29590.23600.2748 ROI-BT,no ne0.46000.36700.4246 ROI-NB,ne0.37240.30470.3380 ROI-NB,no ne0.40720.32690.3718
11
Evaluation results Story link detection Various types of databases 1Dcos4DcosUDcos original0.24720.19830.2439 ROI-BT,ne0.30900.25870.2938 ROI-BT,no ne0.32200.26490.3020 ROI-NB,ne0.28670.24070.2697 ROI-NB,no ne0.29370.24630.2738
12
ROI for tracking Compare story to centroid of topic Built from training stories If ROI does not match, drop score based on how bad mismatch is Used ROI-BT,ne classifier only
13
Training for tracking Various types of databases 1Dcos4DcosADcosUDcos Nt=1 orig0.18900.18190.13900.1819 ROI-BT,ne0.16590.14890.12800.1541 Nt=4 orig0.14270.12940.10760.1321 ROI-BT,ne0.16390.13140.10780.1494 Topic tracking on TDT-3 Minimum normalized cost ROI BoosTexter with named entities only
14
Evaluation results Various types of databases 1Dcos4DcosADcosUDcos Nt=1 orig0.19680.21490.22700.2604 ROI-BT,ne0.39650.38070.35720.5002 Nt=4 orig0.17160.16100.14630.1988 ROI-BT,ne0.29960.26820.25250.3677 Topic tracking on TDT-3 Minimum normalized cost ROI BoosTexter with named entities only
15
ROI-based vocabulary pruning New Event Detection only Create “stop list” for each ROI 300 most frequent terms in stories within ROI Obtained from TDT-2 corpus When story is classified into an ROI… Remove those terms from the story’s vector ROI determined from BoosTexter classifier
16
New Event Detection approach Cosine Similarity measure ROI-based vocabulary pruning Score normalization Incremental IDF Remove short documents Preprocessing Train BoosTexter on TDT-2 &TDT-3 Include named entities while training
17
NED Results TDT 3TDT 4
18
ROI Conclusions Both uses of ROI helped in training Score reduction for ROI mismatch Tracking and link detection Vocabulary pruning for new event detection Score reduction failed in evaluation Name entities important in ROI classifier TDT-4 has different set of entities (time gap) Possible overfitting to TDT-3? Preliminary work applying to detection Unsuccessful to date
19
Outline Rule of Interpretation (ROI) classification ROI-based vocabulary reduction Cross-language techniques Dictionary translation of Arabic stories Native language comparisons Adaptive tracking Relevance models
20
Comparing multilingual stories Baseline All stories converted to English Using provided machine translations New approaches Dictionary translation of Arabic stories Native language comparisons Adaptation in tracking
21
Dictionary Translation of Arabic Probabilistic translation model Each Arabic word has multiple English translations Obtain P(e|a) from UN Arabic-English parallel corpus Forms a pseudo-story in English representing Arabic Story Can get large due to multiple translations per word Keep English words whose summed probabilities are the greatest
22
Language specific comparisons Language representations: Arabic CP1256 encoding and light stemming English stopped and stemmed with kstem Chinese segmented if necessary and overlapping bigrams Linking Task: If stories in same language, use that language All other comparisons done using all stories translated into English
23
Adaptation in tracking Adaptation Stories added to topic when high similarity score Establish topic representation in each language as soon as added story in that language appears Similarity of Arabic story compared to Arabic topic representation, etc.
24
Cross-Lingual Link Detection Results Translation Condition Minimum CostCost TDT-3TDT-4 1DcostIDF0.35360.24720.2523 UDcosIDF0.3254 (-8 %)0.2439 (-1%)0.2597 4DcosIDF0.2556 (-28%)0.1983 (-20%)0.2000 Translation Conditions: 1DcosIDF: baseline, all stories in English using provided translations. UDcosIDF: all stories in English but using dictionary translation of Arabic. 4DcosIDF: comparing a pair of stories in native language if both stories within the same language, otherwise comparing them in English using the dictionary translation of Arabic
25
Cross-Lingual Topic Tracking Results (required condition: Nt=1,bnman) Translation Condition Minimum CostCost TDT-3TDT-4 1DcostIDF0.18900.19680.1964 UDcosIDF0.1853 (-2 %)0.2024 (+3%)0.2604 4DcosIDF0.1819 (-4%)0.2036 (+3%)0.2149 ADcosIDF0.1390 (-26%)0.2007 (+2%)0.2270 Translation Conditions: 1DcosIDF: baseline. UDcosIDF: dictionary translation of Arabic. 4DcosIDF: comparing a pair of stories in native language. ADcosIDF: baseline plus adaptation, add a story to the centroid vector if its similarity score > adapting threshold, the vector limited top 100 terms, at maximum 100 stories could be added to the centroid.
26
Cross-Lingual Topic Tracking Results (alternate condition: Nt=4,bnasr) Translation Conditions: 1DcosIDF: baseline. UDcosIDF: dictionary translation of Arabic. 4DcosIDF: comparing a pair of stories in native language. ADcosIDF: baseline plus adaptation. Translation Condition Minimum CostCost TDT-3TDT-4 1DcostIDF0.14270.16760.1716 UDcosIDF0.1321 (-7 %)0.1594 (-5 %)0.1988 4DcosIDF0.1294 (-9 %)0.1501 (-10%)0.1610 ADcosIDF0.1076 (-25%)0.1443 (-14%)0.1463
27
Outline Rule of Interpretation (ROI) classification ROI-based vocabulary reduction Cross-language techniques Dictionary translation of Arabic stories Native language comparisons Adaptive tracking Relevance models
28
Relevance Models for SLD Relevance Model (RM): “model of stories relevant to a query” Algorithm: Given stories A,B 1.compute “queries” Q A and Q B 2.estimate relevance models P(w|Q A ) and P(w|Q B ) 3.compute divergence between relevance models
29
TDT-3TDT-4 Cosine / tf.idf.2551.1983 Relevance Model.1938.1881 Rel. Model +ROI.1862.1863 Results: Story Link Detection
30
Relevance Models for Tracking 1.Initialize: set P(M|Q) = 1/Nt if M is a training doc compute relevance model as before 2.For each incoming story D: score = divergence between P(w|D) and RM if (score > threshold) add D to the training, recompute RM allow no more than k adaptations
31
TDT-3TDT-4 Cosine / tf.idf.1888.1964 Language Model.1481.2122 Adaptive tf.idf.1390.2007 Relevance Model.0953.1784 Results: Topic Tracking
32
Conclusions Rule of Interpretation (ROI) classification ROI-based vocabulary reduction Cross-language techniques Dictionary translation of Arabic stories Native language comparisons Adaptive tracking Relevance models
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.