UMass Amherst at TDT 2003 James Allan, Alvaro Bolivar, Margie Connell, Steve Cronen-Townsend, Ao Feng, FangFang Feng, Leah Larkey, Giridhar Kumaran, Victor Lavrenko, Ramesh Nallapati, and Hema Raghavan Center for Intelligent Information Retrieval Department of Computer Science University of Massachusetts Amherst
What we did Tasks Story Link Detection Topic Tracking New Event Detection Cluster Detection
Outline Rule of Interpretation (ROI) classification ROI-based vocabulary reduction Cross-language techniques Dictionary translation of Arabic stories Native language comparisons Adaptive tracking Relevance models
ROI motivation Analyzed vector space similarity measures Failed to distinguish between similar topics e.g. two “health care” stories from different topics different locations and individuals similarity dominated by “health care” terms drugs, cost, coverage, plan, prescription Possible solution: first categorize stories different category different topics (mostly true) use within-category statistics “health care” may be less confusing Rules of Interpretation provide natural categories
ROI intuition Each document in the corpus is classified into one of the ROI categories Stories in different ROIs are less likely to be in same topic. If two stories belong to different ROIs, we should trust their similarities less ROI tagged corpus sim new (s1,s2)=sim old (s1,s2) sim new (s1,s2)<sim old (s1,s2) Sn SnSn
ROI classifiers Naïve Bayes BoosTexter [Schapire and Singer, 2000 ] Decision tree classifier Generates and combines simple rules Features are terms with tf as weights Used most likely single class Explored distribution of all classes Unable to do so successfully
Training Data for Classification Experiments: train on TDT-2,test on TDT-3 Submissions: train on TDT-2 plus TDT-3 Training data prepared the same way Stories in each topic tagged with topic’s ROI Remove duplicate stories (in topics with the same ROI) Remove all stories with more than one ROI Worst case: a single story relevant to… Chinese Labor Activists with ROI Legal/Criminal Cases Blair Visits China in October with ROI Political/Diplomatic Mtgs. China will not allow Opposition Parties with ROI Miscellaneous Experiments with removing named entities for training
Naïve Bayes vs. BoosTexter Similar classification accuracy Overall accuracy is the same Errors are substantially different Our training results (TDT-3) BoosTexter beat Naïve Bayes for SLD and NED BoosTexter used in most tasks for submission Evaluation results: In Link Detection, using Naïve Bayes more useful
ROI classes in link detection Given story pair and their estimated ROIs If estimated ROIs are same, leave score alone If they are different, reduce score Reduced to 1/3 of original value based on training runs Used four different ROI classifiers ROI-BT,ne: BoosTexter with named entities ROI-BT, no-ne:BoosTexter without named entities ROI-NB, ne: Naïve Bayes with name entities ROI-NB, no-ne: Naïve Bayes without name entities
Training effectiveness (TDT-3) Story Link Detection Minimum normalized cost Various types of databases 1Dcos4DcosUDcos original ROI-BT,ne ROI-BT,no ne ROI-NB,ne ROI-NB,no ne
Evaluation results Story link detection Various types of databases 1Dcos4DcosUDcos original ROI-BT,ne ROI-BT,no ne ROI-NB,ne ROI-NB,no ne
ROI for tracking Compare story to centroid of topic Built from training stories If ROI does not match, drop score based on how bad mismatch is Used ROI-BT,ne classifier only
Training for tracking Various types of databases 1Dcos4DcosADcosUDcos Nt=1 orig ROI-BT,ne Nt=4 orig ROI-BT,ne Topic tracking on TDT-3 Minimum normalized cost ROI BoosTexter with named entities only
Evaluation results Various types of databases 1Dcos4DcosADcosUDcos Nt=1 orig ROI-BT,ne Nt=4 orig ROI-BT,ne Topic tracking on TDT-3 Minimum normalized cost ROI BoosTexter with named entities only
ROI-based vocabulary pruning New Event Detection only Create “stop list” for each ROI 300 most frequent terms in stories within ROI Obtained from TDT-2 corpus When story is classified into an ROI… Remove those terms from the story’s vector ROI determined from BoosTexter classifier
New Event Detection approach Cosine Similarity measure ROI-based vocabulary pruning Score normalization Incremental IDF Remove short documents Preprocessing Train BoosTexter on TDT-2 &TDT-3 Include named entities while training
NED Results TDT 3TDT 4
ROI Conclusions Both uses of ROI helped in training Score reduction for ROI mismatch Tracking and link detection Vocabulary pruning for new event detection Score reduction failed in evaluation Name entities important in ROI classifier TDT-4 has different set of entities (time gap) Possible overfitting to TDT-3? Preliminary work applying to detection Unsuccessful to date
Outline Rule of Interpretation (ROI) classification ROI-based vocabulary reduction Cross-language techniques Dictionary translation of Arabic stories Native language comparisons Adaptive tracking Relevance models
Comparing multilingual stories Baseline All stories converted to English Using provided machine translations New approaches Dictionary translation of Arabic stories Native language comparisons Adaptation in tracking
Dictionary Translation of Arabic Probabilistic translation model Each Arabic word has multiple English translations Obtain P(e|a) from UN Arabic-English parallel corpus Forms a pseudo-story in English representing Arabic Story Can get large due to multiple translations per word Keep English words whose summed probabilities are the greatest
Language specific comparisons Language representations: Arabic CP1256 encoding and light stemming English stopped and stemmed with kstem Chinese segmented if necessary and overlapping bigrams Linking Task: If stories in same language, use that language All other comparisons done using all stories translated into English
Adaptation in tracking Adaptation Stories added to topic when high similarity score Establish topic representation in each language as soon as added story in that language appears Similarity of Arabic story compared to Arabic topic representation, etc.
Cross-Lingual Link Detection Results Translation Condition Minimum CostCost TDT-3TDT-4 1DcostIDF UDcosIDF (-8 %) (-1%) DcosIDF (-28%) (-20%) Translation Conditions: 1DcosIDF: baseline, all stories in English using provided translations. UDcosIDF: all stories in English but using dictionary translation of Arabic. 4DcosIDF: comparing a pair of stories in native language if both stories within the same language, otherwise comparing them in English using the dictionary translation of Arabic
Cross-Lingual Topic Tracking Results (required condition: Nt=1,bnman) Translation Condition Minimum CostCost TDT-3TDT-4 1DcostIDF UDcosIDF (-2 %) (+3%) DcosIDF (-4%) (+3%) ADcosIDF (-26%) (+2%) Translation Conditions: 1DcosIDF: baseline. UDcosIDF: dictionary translation of Arabic. 4DcosIDF: comparing a pair of stories in native language. ADcosIDF: baseline plus adaptation, add a story to the centroid vector if its similarity score > adapting threshold, the vector limited top 100 terms, at maximum 100 stories could be added to the centroid.
Cross-Lingual Topic Tracking Results (alternate condition: Nt=4,bnasr) Translation Conditions: 1DcosIDF: baseline. UDcosIDF: dictionary translation of Arabic. 4DcosIDF: comparing a pair of stories in native language. ADcosIDF: baseline plus adaptation. Translation Condition Minimum CostCost TDT-3TDT-4 1DcostIDF UDcosIDF (-7 %) (-5 %) DcosIDF (-9 %) (-10%) ADcosIDF (-25%) (-14%)0.1463
Outline Rule of Interpretation (ROI) classification ROI-based vocabulary reduction Cross-language techniques Dictionary translation of Arabic stories Native language comparisons Adaptive tracking Relevance models
Relevance Models for SLD Relevance Model (RM): “model of stories relevant to a query” Algorithm: Given stories A,B 1.compute “queries” Q A and Q B 2.estimate relevance models P(w|Q A ) and P(w|Q B ) 3.compute divergence between relevance models
TDT-3TDT-4 Cosine / tf.idf Relevance Model Rel. Model +ROI Results: Story Link Detection
Relevance Models for Tracking 1.Initialize: set P(M|Q) = 1/Nt if M is a training doc compute relevance model as before 2.For each incoming story D: score = divergence between P(w|D) and RM if (score > threshold) add D to the training, recompute RM allow no more than k adaptations
TDT-3TDT-4 Cosine / tf.idf Language Model Adaptive tf.idf Relevance Model Results: Topic Tracking
Conclusions Rule of Interpretation (ROI) classification ROI-based vocabulary reduction Cross-language techniques Dictionary translation of Arabic stories Native language comparisons Adaptive tracking Relevance models