University of Economics Prague Information Extraction (WP6) Martin Labský MedIEQ meeting Helsinki, 24th October 2006.

University of Economics Prague Information Extraction (WP6) Martin Labský MedIEQ meeting Helsinki, 24th October 2006

Agenda  Initial criteria set  Additional criteria  Information extraction toolkit  Extraction engines  IET demo  Next steps

Initial criteria set – viewed as classes  Resource (1,2,5,6,7,8) { –title, URL, last update, language, –MESH topic, target audience }  Author or Responsible (3,4) { –name, address, phone, e-mail }  6. MESH keywords  9. virtual consultation  10. advertisement  11. seal 2 extractable classes identified 4 standalone attributes to be extracted

Additional criteria – described  Information sources –references to literature (citations) –identified as a whole (no author, title etc. segmentation)  Links to medical organisations –scientific orgs, self-help groups, related websites –name, contact info extracted as for Author/Responsible  Sponsors –name, contact info extracted as for Author/Responsible –sponsor’s policy (free text) extracted in addition  Content provider –name, contact info –provider’s profile (free text) typically from ‘about’ page  Privacy Policy –textual description of what may be done with collected data  Accessibility –identify violation of certain Web Accessibility Initiative criteria

Putting the criteria together Resource title URI last update language MESH topic target audience language initial criteria additional criteria Contact address phone name e-mail Author Responsible www address Sponsor policy Content provider profile MESH keyword virt. con. segment advertisement seal information source privacy policy accessibility warning Medical org.

Information extraction toolkit - architecture INFORMATION EXTRACTION TOOLKIT IE Engines IE Engine 1 (EXO) WP7 Labeled corpora (type B) Documents with assigned n-best classes WP4 Labeling schemas IE Engine 2 (ML) WP5 Integrator Data Model Manager IE Engine 3 (STA) Pre- processor UI Expert’s domain and extraction knowledge IE Engine 0 (NER) Task Manager UI Visualiser WP5 Repository of previously extracted items Annotated documents Extracted attributes, instances Annotation tool UI WP5 Repository of previously extracted items MUA user components admin components

Information extraction toolkit – document flow IE Engine 1 (EXO) IE Engine 2 (ML) IE Engine 3 (STA) Pre- processor IE Engine 0 (NER) classified document select extraction model based on document class extracted attributes and instances extract attributes, add them to document extract attributes, extract instances based on attributes, add them to document

Extraction engines  3 rd party (NER): LingPipe, Annie, BiOs, JET... –extract attributes –state: tested by UNED  ML extractor –extract attributes –state: developed at NCSR  Statistical text extractor –needed to extract free text paragraphs of certain kind e.g. “about company text”, “privacy policy description” –state: future work; TKK will be the owner  Ex (extraction ontology) extractor –extract attributes –extract instances based on identified attributes –state: developed at UEP document flow

Demo  Information Extraction Toolkit –extraction task management task = documents + ex.model + ex.engine definition, load, save, run, monitor progress –can use any IE engine which implements the Engine interface –showing preliminary UI (to be replaced by AQUA)  Ex (extraction ontologies) –contact information sample

Next steps  Integration of more extraction engines into IET  Integration of IET into AQUA  Improve –precision and recall –efficiency

University of Economics Prague Information Extraction (WP6) Martin Labský MedIEQ meeting Helsinki, 24th October 2006.

Similar presentations

Presentation on theme: "University of Economics Prague Information Extraction (WP6) Martin Labský MedIEQ meeting Helsinki, 24th October 2006."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

University of Economics Prague Information Extraction (WP6) Martin Labský MedIEQ meeting Helsinki, 24th October 2006.

Similar presentations

Presentation on theme: "University of Economics Prague Information Extraction (WP6) Martin Labský MedIEQ meeting Helsinki, 24th October 2006."— Presentation transcript:

Similar presentations

About project

Feedback