University of Economics Prague Information Extraction (WP6) Martin Labský MedIEQ meeting Helsinki, 24th October 2006
Agenda Initial criteria set Additional criteria Information extraction toolkit Extraction engines IET demo Next steps
Initial criteria set – viewed as classes Resource (1,2,5,6,7,8) { –title, URL, last update, language, –MESH topic, target audience } Author or Responsible (3,4) { –name, address, phone, } 6. MESH keywords 9. virtual consultation 10. advertisement 11. seal 2 extractable classes identified 4 standalone attributes to be extracted
Additional criteria – described Information sources –references to literature (citations) –identified as a whole (no author, title etc. segmentation) Links to medical organisations –scientific orgs, self-help groups, related websites –name, contact info extracted as for Author/Responsible Sponsors –name, contact info extracted as for Author/Responsible –sponsor’s policy (free text) extracted in addition Content provider –name, contact info –provider’s profile (free text) typically from ‘about’ page Privacy Policy –textual description of what may be done with collected data Accessibility –identify violation of certain Web Accessibility Initiative criteria
Putting the criteria together Resource title URI last update language MESH topic target audience language initial criteria additional criteria Contact address phone name Author Responsible www address Sponsor policy Content provider profile MESH keyword virt. con. segment advertisement seal information source privacy policy accessibility warning Medical org.
Information extraction toolkit - architecture INFORMATION EXTRACTION TOOLKIT IE Engines IE Engine 1 (EXO) WP7 Labeled corpora (type B) Documents with assigned n-best classes WP4 Labeling schemas IE Engine 2 (ML) WP5 Integrator Data Model Manager IE Engine 3 (STA) Pre- processor UI Expert’s domain and extraction knowledge IE Engine 0 (NER) Task Manager UI Visualiser WP5 Repository of previously extracted items Annotated documents Extracted attributes, instances Annotation tool UI WP5 Repository of previously extracted items MUA user components admin components
Information extraction toolkit – document flow IE Engine 1 (EXO) IE Engine 2 (ML) IE Engine 3 (STA) Pre- processor IE Engine 0 (NER) classified document select extraction model based on document class extracted attributes and instances extract attributes, add them to document extract attributes, extract instances based on attributes, add them to document
Extraction engines 3 rd party (NER): LingPipe, Annie, BiOs, JET... –extract attributes –state: tested by UNED ML extractor –extract attributes –state: developed at NCSR Statistical text extractor –needed to extract free text paragraphs of certain kind e.g. “about company text”, “privacy policy description” –state: future work; TKK will be the owner Ex (extraction ontology) extractor –extract attributes –extract instances based on identified attributes –state: developed at UEP document flow
Demo Information Extraction Toolkit –extraction task management task = documents + ex.model + ex.engine definition, load, save, run, monitor progress –can use any IE engine which implements the Engine interface –showing preliminary UI (to be replaced by AQUA) Ex (extraction ontologies) –contact information sample
Next steps Integration of more extraction engines into IET Integration of IET into AQUA Improve –precision and recall –efficiency