Paul van Mulbregt Sheera Knecht Jon Yamron Dragon Systems Detection at Dragon Systems.

Slides:

Advertisements

Similar presentations

Chapter 2: Frequency Distributions

Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

Anthony Greene1 Simple Hypothesis Testing Detecting Statistical Differences In The Simplest Case:  and  are both known I The Logic of Hypothesis Testing:

The Efficiency of Algorithms

Learning Algorithm Evaluation

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

Copyright © 2009 Pearson Education, Inc. Chapter 29 Multiple Regression.

Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.

Decision Errors and Power

Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.

LYRIC-BASED ARTIST NETWORK METHODOLOGY Derek Gossi CS 765 Fall 2014.

Regression. So far, we've been looking at classification problems, in which the y values are either 0 or 1. Now we'll briefly consider the case where.

Statistics for the Social Sciences

BuzzTrack Topic Detection and Tracking in IUI – Intelligent User Interfaces January 2007 Keno Albrecht ETH Zurich Roger Wattenhofer.

Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06.

Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.

Assessing cognitive models What is the aim of cognitive modelling? To try and reproduce, using equations or similar, the mechanism that people are using.

1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A.

1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.

Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.

HCC class lecture 14 comments John Canny 3/9/05. Administrivia.

Radial Basis Function Networks

Scheduling a Large DataCenter Cliff Stein Columbia University Google Research June, 2009 Monika Henzinger, Ana Radovanovic Google Research.

1 Advanced Smoothing, Evaluation of Language Models.

More Machine Learning Linear Regression Squared Error L1 and L2 Regularization Gradient Descent.

Significance Tests …and their significance. Significance Tests Remember how a sampling distribution of means is created? Take a sample of size 500 from.

Topic Detection and Tracking Introduction and Overview.

by B. Zadrozny and C. Elkan

Finding dense components in weighted graphs Paul Horn

Statistics and Quantitative Analysis Chemistry 321, Summer 2014.

The BIRCH Algorithm Davitkov Miroslav, 2011/3116

Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)

COURSE: JUST 3900 INTRODUCTORY STATISTICS FOR CRIMINAL JUSTICE Instructor: Dr. John J. Kerbs, Associate Professor Joint Ph.D. in Social Work and Sociology.

Chapter 21: More About Tests “The wise man proportions his belief to the evidence.” -David Hume 1748.

Yaomin Jin Design of Experiments Morris Method.

CSE 326: Data Structures NP Completeness Ben Lerner Summer 2007.

1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.

Overview of the TDT-2003 Evaluation and Results Jonathan Fiscus NIST Gaithersburg, Maryland November 17-18, 2002.

MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.

Today Ensemble Methods. Recap of the course. Classifier Fusion

1 Chapter 9 Hypothesis Testing. 2 Chapter Outline  Developing Null and Alternative Hypothesis  Type I and Type II Errors  Population Mean: Known 

Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.

EE3561_Unit 4(c)AL-DHAIFALLAH14351 EE 3561 : Computational Methods Unit 4 : Least Squares Curve Fitting Dr. Mujahed Al-Dhaifallah (Term 342) Reading Assignment.

Recurrent neural network based language model Tom´aˇs Mikolov, Martin Karafia´t, Luka´sˇ Burget, Jan “Honza” Cˇernocky, Sanjeev Khudanpur INTERSPEECH 2010.

How many times can you write statistics in a minute? By: Madeline Stenken and Tara Levine.

Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan,

Results of the 2000 Topic Detection and Tracking Evaluation in Mandarin and English Jonathan Fiscus and George Doddington.

Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏

A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-

Chapter 2: Frequency Distributions. Frequency Distributions After collecting data, the first task for a researcher is to organize and simplify the data.

TDT 2000 Workshop Lessons Learned These slides represent some of the ideas that were tried for TDT 2000, some conclusions that were reached about techniques.

A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:

Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.

TDT 2004 Unsupervised and Supervised Tracking Hema Raghavan UMASS-Amherst at TDT 2004.

The normal approximation for probability histograms.

CHAPTER 11 Mean and Standard Deviation. BOX AND WHISKER PLOTS  Worksheet on Interpreting and making a box and whisker plot in the calculator.

The Law of Averages. What does the law of average say? We know that, from the definition of probability, in the long run the frequency of some event will.

Kalman Filter and Data Streaming Presented By :- Ankur Jain Department of Computer Science 7/21/03.

Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 21 More About Tests and Intervals.

Stats Methods at IC Lecture 3: Regression.

PSY 626: Bayesian Statistics for Psychological Science

Chapter 21 More About Tests.

Objective: Given a data set, compute measures of center and spread.

Office of Education Improvement and Innovation

PSY 626: Bayesian Statistics for Psychological Science

Decision Errors and Power

6.2 Grid Search of Chi-Square Space

MGS 3100 Business Analysis Regression Feb 18, 2016

Presentation transcript:

Paul van Mulbregt Sheera Knecht Jon Yamron Dragon Systems Detection at Dragon Systems

Outline Data Sets Interpolated Models Targeting Against a Background English & Mandarin Word Stemming Effect of Automatic Boundaries 1999 System compared to 2000 Systems Comments on CDet Metric Conclusions

Data Sets Two main data sets for experimentation May/June English 1998 from TDT2. Trained on January-April 1998 April-May-June English 1998 from TDT2 (AMJ Data Set) Trained on January-March April 1998 May/June has only 34 topics, whilst April/May/June has 70 topics (69 after removal of 1 topic all of whose documents are on multiple topics). Smaller amount of training data, but larger number of topics hopefully allowing more informed decisions to be made. So we have been using AMJ for almost all recent experiments.

Interpolated Models For Tracking, Interpolation of Unigram Model with Background Model had been an improvement over backing off to Background Model.

Interpolated Models vs Backoff Models AMJ, English only, Manual Boundaries Interpolated appear to be a consistent win over Backoff.

Targeting Input Against Larger Background Corpus The amount of data in a collection of TDT documents on a particular topic is not large. In Tracking, between 1 and 4 documents. In Detection there may be as few documents as 1 in a cluster. Idea: Target the collection of documents against a much bigger (background) collection of documents. Augment the statistics of the small collection with the statistics of the big collection, and build a model from that.

Targeting (Tracking) For Tracking, we actually do this. Take the seed documents (from TDT3), target against the background collection of documents in TDT2. Each document in TDT2 is assigned a weight, and these weights are then used to construct new counts for the seed collection. Prob(w) = Sum (weight(Document d) * Prob(w in Document d)), summed over all documents. Linearly interpolate this distribution with the original distribution from the seed documents. (Interpolate again with background to avoid effect of zeros.)

Targeting (Detection) For Detection, this would involve a large amount of work. Every time a cluster changes, target against the background and rebuild the statistics... Instead, target the incoming documents against the background just once. Interpolate the counts of the document with computed counts from the background corpus. Zeros don’t matter as this is for the incoming documents. The hope is that this targeting will bring in background documents with words that didn’t occur in the original document, making it easier to pick up documents which discuss the topic. Since these statistics are dumped into the clusters, it has the effect of providing smoothing the clusters.

Targeting Data Mixed in with Actual Data using Various Weightings. AMJ, English only, Manual boundaries. Didn’t improve best performance, but flattens out the graph. Clearly 100% targeting is very sub-optimal, but a 15% mix was useful.

English vs English & Mandarin TDT 2000 DryRun. Very noisy graph. Few topics in Mandarin, so not much in the way of conclusions to be drawn.

Manual vs Automatic Boundaries AMJ, English Only About a 20% degradation for using Automatic Boundaries.

Stemming vs no Stemming AMJ, English only, Manual Boundaries. Stemming may make the graph a little less noisy, but...

Manual vs Automatic, Stemmed vs No-Stemmed AMJ, English, Stem and No-Stem, Manual and Automatic Boundaries. Stemming may help for manual boundaries, but appears a little worse for Automatic Boundaries.

1999 System vs 2000 Systems AMJ, English only system (TFKLB Backoff); 2000 systems Dragon 1 = TFKLB Interpolated, mixed 15% with targeted Background; Dragon2 = TFKLB Interpolated). Dragon1, Dragon2 both better than 1999 System. Dragon1 is flatter than Dragon2, but not necessarily better.

2000 Results, Dragon2 System, Manual Boundaries, Interpolated. Suffer a performance loss on English Detection by including Mandarin documents. Performance on Newswire and BN seems comparable. Reporting results on subsets doesn’t make much sense for Detection, especially language specific subsets. (For Tracking without adaptation, this should not be an issue.)

2000 Evaluation Numbers Performance is better on Mandarin with Automatic boundaries than Manual boundaries. I don’t actually believe that this should be the case! 23% 21% 19% -10%!! 56% -25% About 20% reduction in performance due to using Automatic boundaries

Why so non-continuous? One cluster can split into two clusters, or can lose half its documents. Small change in the number of correct leads to a big change in the score. The de-emphasis in the evaluation measure on False Alarms means the smaller “purer” cluster is regarded as being 9 times worse then the other cluster.

YDZ Metric? YDZ seems conceptually a reasonable way of measuring goodness of fit.

However... In practice however, it seems to have two problems: No minimum value. In fact, it appears to be linear across a wide range of number of clusters. Even for choices of CMiss where it is not linear, the metric is not particularly discriminatory. (Assuming of course, that our system is producing outputs that do in fact have some difference.) Sign of linear coefficient of depends on size of CMiss. Same issue as with CDet - what is a realistic use of the technology, and how to measure performance on that task. Or one can just not spend time tuning for an evaluation -- just concentrate on improving the algorithm and lower the whole graph.

Miss-False Alarms on a DET plot, with Level Curves of CDet AMJ, English only, Manual Boundaries. The number of generated clusters varies from 16 (at the far right) to 4640 (at the far left) with the level curve intersections corresponding to 633 and 2204 clusters.

Why so little change in CDet? When sweeping over a wide range of thresholds, with the number of clusters changing by a factor of more than 10, why is there so little change in the value of Cdet? Find it hard to believe that 200 clusters are as equally useful as 2600 clusters. Is it our (Dragon’s) distance measure? Is this phenomenon just restricted to one site or is it across all sites? Discontinuities lead to wondering whether reported score differences are actually significant?

Overall Conclusions Interpolation is better than backoff as a smoothing method. Mixing targeted data is one approach to bringing in outside information, and it helped to smooth out performance but not to improve it. Stemming may also smooth out performance without providing any overall gain. Questions about CDet metric still remain. Breakout of scores by subset do not make much sense for Detection.