UMass Amherst at TDT 2003 James Allan, Alvaro Bolivar, Margie Connell, Steve Cronen-Townsend, Ao Feng, FangFang Feng, Leah Larkey, Giridhar Kumaran, Victor.

UMass Amherst at TDT 2003 James Allan, Alvaro Bolivar, Margie Connell, Steve Cronen-Townsend, Ao Feng, FangFang Feng, Leah Larkey, Giridhar Kumaran, Victor Lavrenko, Ramesh Nallapati, and Hema Raghavan Center for Intelligent Information Retrieval Department of Computer Science University of Massachusetts Amherst

What we did  Tasks Story Link Detection Topic Tracking New Event Detection Cluster Detection

Outline  Rule of Interpretation (ROI) classification  ROI-based vocabulary reduction  Cross-language techniques Dictionary translation of Arabic stories Native language comparisons Adaptive tracking  Relevance models

ROI motivation  Analyzed vector space similarity measures Failed to distinguish between similar topics e.g. two “health care” stories from different topics  different locations and individuals  similarity dominated by “health care” terms drugs, cost, coverage, plan, prescription  Possible solution: first categorize stories different category  different topics (mostly true) use within-category statistics  “health care” may be less confusing Rules of Interpretation provide natural categories

ROI intuition Each document in the corpus is classified into one of the ROI categories Stories in different ROIs are less likely to be in same topic. If two stories belong to different ROIs, we should trust their similarities less ROI tagged corpus sim new (s1,s2)=sim old (s1,s2) sim new (s1,s2)<sim old (s1,s2) Sn SnSn

ROI classifiers  Naïve Bayes  BoosTexter [Schapire and Singer, 2000 ] Decision tree classifier Generates and combines simple rules Features are terms with tf as weights  Used most likely single class Explored distribution of all classes Unable to do so successfully

Training Data for Classification  Experiments: train on TDT-2,test on TDT-3 Submissions: train on TDT-2 plus TDT-3  Training data prepared the same way Stories in each topic tagged with topic’s ROI Remove duplicate stories (in topics with the same ROI) Remove all stories with more than one ROI  Worst case: a single story relevant to… Chinese Labor Activists with ROI Legal/Criminal Cases Blair Visits China in October with ROI Political/Diplomatic Mtgs. China will not allow Opposition Parties with ROI Miscellaneous Experiments with removing named entities for training

Naïve Bayes vs. BoosTexter  Similar classification accuracy Overall accuracy is the same Errors are substantially different  Our training results (TDT-3) BoosTexter beat Naïve Bayes for SLD and NED  BoosTexter used in most tasks for submission  Evaluation results: In Link Detection, using Naïve Bayes more useful

ROI classes in link detection  Given story pair and their estimated ROIs  If estimated ROIs are same, leave score alone  If they are different, reduce score Reduced to 1/3 of original value based on training runs  Used four different ROI classifiers ROI-BT,ne: BoosTexter with named entities ROI-BT, no-ne:BoosTexter without named entities ROI-NB, ne: Naïve Bayes with name entities ROI-NB, no-ne: Naïve Bayes without name entities

Training effectiveness (TDT-3)  Story Link Detection  Minimum normalized cost Various types of databases 1Dcos4DcosUDcos original0.35360.25560.3254 ROI-BT,ne0.29590.23600.2748 ROI-BT,no ne0.46000.36700.4246 ROI-NB,ne0.37240.30470.3380 ROI-NB,no ne0.40720.32690.3718

Evaluation results  Story link detection Various types of databases 1Dcos4DcosUDcos original0.24720.19830.2439 ROI-BT,ne0.30900.25870.2938 ROI-BT,no ne0.32200.26490.3020 ROI-NB,ne0.28670.24070.2697 ROI-NB,no ne0.29370.24630.2738

ROI for tracking  Compare story to centroid of topic Built from training stories  If ROI does not match, drop score based on how bad mismatch is  Used ROI-BT,ne classifier only

Training for tracking Various types of databases 1Dcos4DcosADcosUDcos Nt=1 orig0.18900.18190.13900.1819 ROI-BT,ne0.16590.14890.12800.1541 Nt=4 orig0.14270.12940.10760.1321 ROI-BT,ne0.16390.13140.10780.1494  Topic tracking on TDT-3  Minimum normalized cost  ROI BoosTexter with named entities only

Evaluation results Various types of databases 1Dcos4DcosADcosUDcos Nt=1 orig0.19680.21490.22700.2604 ROI-BT,ne0.39650.38070.35720.5002 Nt=4 orig0.17160.16100.14630.1988 ROI-BT,ne0.29960.26820.25250.3677  Topic tracking on TDT-3  Minimum normalized cost  ROI BoosTexter with named entities only

ROI-based vocabulary pruning  New Event Detection only  Create “stop list” for each ROI 300 most frequent terms in stories within ROI Obtained from TDT-2 corpus  When story is classified into an ROI… Remove those terms from the story’s vector  ROI determined from BoosTexter classifier

New Event Detection approach  Cosine Similarity measure ROI-based vocabulary pruning Score normalization Incremental IDF Remove short documents  Preprocessing Train BoosTexter on TDT-2 &TDT-3 Include named entities while training

NED Results TDT 3TDT 4

ROI Conclusions  Both uses of ROI helped in training Score reduction for ROI mismatch  Tracking and link detection Vocabulary pruning for new event detection  Score reduction failed in evaluation Name entities important in ROI classifier  TDT-4 has different set of entities (time gap) Possible overfitting to TDT-3?  Preliminary work applying to detection Unsuccessful to date

Comparing multilingual stories  Baseline All stories converted to English Using provided machine translations  New approaches Dictionary translation of Arabic stories Native language comparisons Adaptation in tracking

Dictionary Translation of Arabic  Probabilistic translation model  Each Arabic word has multiple English translations Obtain P(e|a) from UN Arabic-English parallel corpus  Forms a pseudo-story in English representing Arabic Story  Can get large due to multiple translations per word  Keep English words whose summed probabilities are the greatest

Language specific comparisons  Language representations: Arabic CP1256 encoding and light stemming English stopped and stemmed with kstem Chinese segmented if necessary and overlapping bigrams  Linking Task: If stories in same language, use that language All other comparisons done using all stories translated into English

Adaptation in tracking  Adaptation Stories added to topic when high similarity score Establish topic representation in each language as soon as added story in that language appears Similarity of Arabic story compared to Arabic topic representation, etc.

Cross-Lingual Link Detection Results Translation Condition Minimum CostCost TDT-3TDT-4 1DcostIDF0.35360.24720.2523 UDcosIDF0.3254 (-8 %)0.2439 (-1%)0.2597 4DcosIDF0.2556 (-28%)0.1983 (-20%)0.2000 Translation Conditions:  1DcosIDF: baseline, all stories in English using provided translations.  UDcosIDF: all stories in English but using dictionary translation of Arabic.  4DcosIDF: comparing a pair of stories in native language if both stories within the same language, otherwise comparing them in English using the dictionary translation of Arabic

Cross-Lingual Topic Tracking Results (required condition: Nt=1,bnman) Translation Condition Minimum CostCost TDT-3TDT-4 1DcostIDF0.18900.19680.1964 UDcosIDF0.1853 (-2 %)0.2024 (+3%)0.2604 4DcosIDF0.1819 (-4%)0.2036 (+3%)0.2149 ADcosIDF0.1390 (-26%)0.2007 (+2%)0.2270 Translation Conditions:  1DcosIDF: baseline.  UDcosIDF: dictionary translation of Arabic.  4DcosIDF: comparing a pair of stories in native language.  ADcosIDF: baseline plus adaptation, add a story to the centroid vector if its similarity score > adapting threshold, the vector limited top 100 terms, at maximum 100 stories could be added to the centroid.

Cross-Lingual Topic Tracking Results (alternate condition: Nt=4,bnasr) Translation Conditions:  1DcosIDF: baseline.  UDcosIDF: dictionary translation of Arabic.  4DcosIDF: comparing a pair of stories in native language.  ADcosIDF: baseline plus adaptation. Translation Condition Minimum CostCost TDT-3TDT-4 1DcostIDF0.14270.16760.1716 UDcosIDF0.1321 (-7 %)0.1594 (-5 %)0.1988 4DcosIDF0.1294 (-9 %)0.1501 (-10%)0.1610 ADcosIDF0.1076 (-25%)0.1443 (-14%)0.1463

Relevance Models for SLD  Relevance Model (RM): “model of stories relevant to a query”  Algorithm: Given stories A,B 1.compute “queries” Q A and Q B 2.estimate relevance models P(w|Q A ) and P(w|Q B ) 3.compute divergence between relevance models

TDT-3TDT-4 Cosine / tf.idf.2551.1983 Relevance Model.1938.1881 Rel. Model +ROI.1862.1863 Results: Story Link Detection

Relevance Models for Tracking 1.Initialize: set P(M|Q) = 1/Nt if M is a training doc compute relevance model as before 2.For each incoming story D: score = divergence between P(w|D) and RM if (score > threshold) add D to the training, recompute RM allow no more than k adaptations

TDT-3TDT-4 Cosine / tf.idf.1888.1964 Language Model.1481.2122 Adaptive tf.idf.1390.2007 Relevance Model.0953.1784 Results: Topic Tracking

Conclusions  Rule of Interpretation (ROI) classification  ROI-based vocabulary reduction  Cross-language techniques Dictionary translation of Arabic stories Native language comparisons Adaptive tracking  Relevance models

UMass Amherst at TDT 2003 James Allan, Alvaro Bolivar, Margie Connell, Steve Cronen-Townsend, Ao Feng, FangFang Feng, Leah Larkey, Giridhar Kumaran, Victor.

Similar presentations

Presentation on theme: "UMass Amherst at TDT 2003 James Allan, Alvaro Bolivar, Margie Connell, Steve Cronen-Townsend, Ao Feng, FangFang Feng, Leah Larkey, Giridhar Kumaran, Victor."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

UMass Amherst at TDT 2003 James Allan, Alvaro Bolivar, Margie Connell, Steve Cronen-Townsend, Ao Feng, FangFang Feng, Leah Larkey, Giridhar Kumaran, Victor.

Similar presentations

Presentation on theme: "UMass Amherst at TDT 2003 James Allan, Alvaro Bolivar, Margie Connell, Steve Cronen-Townsend, Ao Feng, FangFang Feng, Leah Larkey, Giridhar Kumaran, Victor."— Presentation transcript:

Similar presentations

About project

Feedback