Presentation is loading. Please wait.

Presentation is loading. Please wait.

New Event Detection at UMass Amherst Giridhar Kumaran and James Allan.

Similar presentations


Presentation on theme: "New Event Detection at UMass Amherst Giridhar Kumaran and James Allan."— Presentation transcript:

1 New Event Detection at UMass Amherst Giridhar Kumaran and James Allan

2 CIIR, UMass Amherst2 Preprocessing  Lemur Toolkit for tokenization, stopping, k-stemming http://www-2.cs.cmu.edu/~lemur/  BBN Identifinder™ for extracting named entities

3 CIIR, UMass Amherst3 Systems fielded  Submitted four systems  Didn’t include last year’s system Classification according to LDC categories and term – pruning Didn’t work on exclusively NW story corpus

4 CIIR, UMass Amherst4 Primary system – UMass1  Utility of named entities acknowledged  Failure analysis indicates Large number of old stories have low confidence score (false alarms) Conflict with new story scores Reasons  Stories on multiple topics  Diffuse topics  Varying document lengths

5 CIIR, UMass Amherst5 Primary system – UMass1  Focus Identify old stories better – affects cost  Clue Most old stories get low confidence scores as topics linked by  only named entities (large number)  only non-named entities (few)

6 CIIR, UMass Amherst6 Primary system – UMass1  Approach Look at the set of closest matching stories If consistently high named entity or non-named entity match modify confidence score

7 CIIR, UMass Amherst7 Primary system – UMass1  Procedure Double original confidence score if less than a threshold Gradually reduce score towards original score if set of closest stories match neither named entities nor non-named entities

8 CIIR, UMass Amherst8 UMass1 – Examples from TDT3  Russian Financial Crisis - Old Story APW19981020.0237 AllSimNESimnoNESim APW19981015.01390.2780.2730.270 APW19981009.07900.251 0.3660.178 APW19981016.06690.2370.4230.166 APW19981006.05090.2110.3590.107 APW19981013.05820.2060.3950.056 APW19981006.02290.1960.5100.047

9 CIIR, UMass Amherst9 UMass1 – Examples from TDT3  Russian Financial Crisis - Old Story APW19981020.0237 AllSimNESimnoNESim APW19981015.0139 0.278 0.2730.270 APW19981009.07900.251 0.3660.178 APW19981016.06690.2370.4230.166 APW19981006.05090.2110.3590.107 APW19981013.05820.2060.3950.056 APW19981006.02290.1960.5100.047

10 CIIR, UMass Amherst10 UMass1 – Examples from TDT3  Russian Financial Crisis - Old Story APW19981020.0237 AllSimNESimnoNESim APW19981015.0139 0.278 0.2730.270 APW19981009.07900.251 0.3660.178 APW19981016.06690.2370.4230.166 APW19981006.05090.2110.3590.107 APW19981013.05820.2060.3950.056 APW19981006.02290.1960.5100.047 Threshold = 0.1

11 CIIR, UMass Amherst11 UMass1 – Examples from TDT3  Russian Financial Crisis - Old Story APW19981020.0237 AllSimNESimnoNESim APW19981015.0139 0.278 0.2730.270 APW19981009.07900.251 0.3660.178 APW19981016.06690.2370.4230.166 APW19981006.05090.2110.3590.107 APW19981013.05820.2060.3950.056 APW19981006.02290.1960.5100.047 Threshold = 0.1

12 CIIR, UMass Amherst12 UMass1 – Examples from TDT3  Russian Financial Crisis - Old Story APW19981020.0237 AllSimNESimnoNESim APW19981015.0139 0.278*1.6 0.2730.270 APW19981009.07900.251 0.3660.178 APW19981016.06690.2370.4230.166 APW19981006.05090.2110.3590.107 APW19981013.05820.2060.3950.056 APW19981006.02290.1960.5100.047 Threshold = 0.1

13 CIIR, UMass Amherst13 UMass1 – Examples from TDT3  Thai Airbus Crash - New Story APW19981211.0623 AllSimNESimnoNESim APW19981022.0205 0.250*1.2 0.1540.341 APW19981110.02290.184 0.0520.282 APW19981113.09050.1550.0030.228 APW19981002.05570.1520.2340.012 APW19981114.03960.1490.0420.245 APW19981006.0511 0.1430.0310.251

14 CIIR, UMass Amherst14 UMass1 on TDT3

15 CIIR, UMass Amherst15 UMass1 on TDT3

16 CIIR, UMass Amherst16 UMass2  Basic vector space model system  Compare with all preceding stories  Return highest cosine match

17 CIIR, UMass Amherst17 UMass3  Same model as UMass2  TDT5 – Very large collection  Practical system  Compare with a maximum of 25000 stories with highest coordination match Faster

18 CIIR, UMass Amherst18 UMass4  Similar to UMass1  Rationale is the same  Consider top five matches  Use different formula for modifying confidence score

19 CIIR, UMass Amherst19 Performance Summary System Topic weighted min. cost (TDT5) Topic weighted min. cost (TDT4) UMass1 – Modify confidence score based on evidence 0.87900.5055 UMass2 – Basic vector space model 0.83870.5404 UMass3 – UMass2 + restriction on number of documents compared with 0.84790.5404 UMass4 – UMass1 with different formula 0.9213--

20 CIIR, UMass Amherst20 Summary  Basic vector space model did the best  Restricting number of stories to be compared with Improved system speed Didn’t improve performance  Primary system did extremely well on training data, but failed on TDT5


Download ppt "New Event Detection at UMass Amherst Giridhar Kumaran and James Allan."

Similar presentations


Ads by Google