Download presentation
Presentation is loading. Please wait.
Published byClaud Knight Modified over 9 years ago
1
New Event Detection at UMass Amherst Giridhar Kumaran and James Allan
2
CIIR, UMass Amherst2 Preprocessing Lemur Toolkit for tokenization, stopping, k-stemming http://www-2.cs.cmu.edu/~lemur/ BBN Identifinder™ for extracting named entities
3
CIIR, UMass Amherst3 Systems fielded Submitted four systems Didn’t include last year’s system Classification according to LDC categories and term – pruning Didn’t work on exclusively NW story corpus
4
CIIR, UMass Amherst4 Primary system – UMass1 Utility of named entities acknowledged Failure analysis indicates Large number of old stories have low confidence score (false alarms) Conflict with new story scores Reasons Stories on multiple topics Diffuse topics Varying document lengths
5
CIIR, UMass Amherst5 Primary system – UMass1 Focus Identify old stories better – affects cost Clue Most old stories get low confidence scores as topics linked by only named entities (large number) only non-named entities (few)
6
CIIR, UMass Amherst6 Primary system – UMass1 Approach Look at the set of closest matching stories If consistently high named entity or non-named entity match modify confidence score
7
CIIR, UMass Amherst7 Primary system – UMass1 Procedure Double original confidence score if less than a threshold Gradually reduce score towards original score if set of closest stories match neither named entities nor non-named entities
8
CIIR, UMass Amherst8 UMass1 – Examples from TDT3 Russian Financial Crisis - Old Story APW19981020.0237 AllSimNESimnoNESim APW19981015.01390.2780.2730.270 APW19981009.07900.251 0.3660.178 APW19981016.06690.2370.4230.166 APW19981006.05090.2110.3590.107 APW19981013.05820.2060.3950.056 APW19981006.02290.1960.5100.047
9
CIIR, UMass Amherst9 UMass1 – Examples from TDT3 Russian Financial Crisis - Old Story APW19981020.0237 AllSimNESimnoNESim APW19981015.0139 0.278 0.2730.270 APW19981009.07900.251 0.3660.178 APW19981016.06690.2370.4230.166 APW19981006.05090.2110.3590.107 APW19981013.05820.2060.3950.056 APW19981006.02290.1960.5100.047
10
CIIR, UMass Amherst10 UMass1 – Examples from TDT3 Russian Financial Crisis - Old Story APW19981020.0237 AllSimNESimnoNESim APW19981015.0139 0.278 0.2730.270 APW19981009.07900.251 0.3660.178 APW19981016.06690.2370.4230.166 APW19981006.05090.2110.3590.107 APW19981013.05820.2060.3950.056 APW19981006.02290.1960.5100.047 Threshold = 0.1
11
CIIR, UMass Amherst11 UMass1 – Examples from TDT3 Russian Financial Crisis - Old Story APW19981020.0237 AllSimNESimnoNESim APW19981015.0139 0.278 0.2730.270 APW19981009.07900.251 0.3660.178 APW19981016.06690.2370.4230.166 APW19981006.05090.2110.3590.107 APW19981013.05820.2060.3950.056 APW19981006.02290.1960.5100.047 Threshold = 0.1
12
CIIR, UMass Amherst12 UMass1 – Examples from TDT3 Russian Financial Crisis - Old Story APW19981020.0237 AllSimNESimnoNESim APW19981015.0139 0.278*1.6 0.2730.270 APW19981009.07900.251 0.3660.178 APW19981016.06690.2370.4230.166 APW19981006.05090.2110.3590.107 APW19981013.05820.2060.3950.056 APW19981006.02290.1960.5100.047 Threshold = 0.1
13
CIIR, UMass Amherst13 UMass1 – Examples from TDT3 Thai Airbus Crash - New Story APW19981211.0623 AllSimNESimnoNESim APW19981022.0205 0.250*1.2 0.1540.341 APW19981110.02290.184 0.0520.282 APW19981113.09050.1550.0030.228 APW19981002.05570.1520.2340.012 APW19981114.03960.1490.0420.245 APW19981006.0511 0.1430.0310.251
14
CIIR, UMass Amherst14 UMass1 on TDT3
15
CIIR, UMass Amherst15 UMass1 on TDT3
16
CIIR, UMass Amherst16 UMass2 Basic vector space model system Compare with all preceding stories Return highest cosine match
17
CIIR, UMass Amherst17 UMass3 Same model as UMass2 TDT5 – Very large collection Practical system Compare with a maximum of 25000 stories with highest coordination match Faster
18
CIIR, UMass Amherst18 UMass4 Similar to UMass1 Rationale is the same Consider top five matches Use different formula for modifying confidence score
19
CIIR, UMass Amherst19 Performance Summary System Topic weighted min. cost (TDT5) Topic weighted min. cost (TDT4) UMass1 – Modify confidence score based on evidence 0.87900.5055 UMass2 – Basic vector space model 0.83870.5404 UMass3 – UMass2 + restriction on number of documents compared with 0.84790.5404 UMass4 – UMass1 with different formula 0.9213--
20
CIIR, UMass Amherst20 Summary Basic vector space model did the best Restricting number of stories to be compared with Improved system speed Didn’t improve performance Primary system did extremely well on training data, but failed on TDT5
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.