Preliminaries CSCI-GA.2591 NYU Preliminaries CSCI-GA.2591 Ralph Grishman
Goal and Approach What are the limitations in extracting knowledge from text? Approach: start with a skeleton MR system enhance individual components enhance ensemble estimate confidence of components estimate confidence of combined system use domain model (Markov Logic Network) learn as we go
Requirements provenance — addressed by using Tipster arch, UIMA speed - fast enough for rapid development scaling … develop algorithms of time linear or n log n and use DB-based system (we won't) capture domain constraints: rule-based inference capture uncertainty: MLN enable joint inference: MLN domain adaptive: emphasize task-specific components
Schedule 1. preliminaries (tipster, jet-lite, ACE); sentence segmentation 2. NE 3. coreference 4. XD coreference; brief plan reports 5. relations 6. event 7. time 8. reports on components 9. joint inference: opportunities 10. prob graphical models: beam search, belief prop. 11. Alchemy; domain models 12 KBP systems 13 self-learning 14. project reports
ACE Our system will be designed to read a document and extract the entities, relations, and events To train and evaluate our system, we need a corpus annotated with this information We will use the ACE 2005 corpus and the domain of national and international news (very broad)
Domains News domain is very broad, hard to model Difficult to see impact of domain model on language analysis Time permitting, may use a second, narrower model football game reports hurricane news … suggestions?
Corpora ACE 2005: 300 kw Penn Tree Bank Reuters OntoNotes defines classes of relations and events widely used benchmark 6 genres being augmented by ERE annotation Penn Tree Bank for sentences and POS Reuters for NE annotation OntoNotes for coreference and word sense training