Presentation is loading. Please wait.

Presentation is loading. Please wait.

University of Economics Prague Information Extraction (WP6) Martin Labský MedIEQ meeting Helsinki, 24th October 2006.

Similar presentations

Presentation on theme: "University of Economics Prague Information Extraction (WP6) Martin Labský MedIEQ meeting Helsinki, 24th October 2006."— Presentation transcript:

1 University of Economics Prague Information Extraction (WP6) Martin Labský MedIEQ meeting Helsinki, 24th October 2006

2 Agenda  Initial criteria set  Additional criteria  Information extraction toolkit  Extraction engines  IET demo  Next steps

3 Initial criteria set – viewed as classes  Resource (1,2,5,6,7,8) { –title, URL, last update, language, –MESH topic, target audience }  Author or Responsible (3,4) { –name, address, phone, e-mail }  6. MESH keywords  9. virtual consultation  10. advertisement  11. seal 2 extractable classes identified 4 standalone attributes to be extracted

4 Additional criteria – described  Information sources –references to literature (citations) –identified as a whole (no author, title etc. segmentation)  Links to medical organisations –scientific orgs, self-help groups, related websites –name, contact info extracted as for Author/Responsible  Sponsors –name, contact info extracted as for Author/Responsible –sponsor’s policy (free text) extracted in addition  Content provider –name, contact info –provider’s profile (free text) typically from ‘about’ page  Privacy Policy –textual description of what may be done with collected data  Accessibility –identify violation of certain Web Accessibility Initiative criteria

5 Putting the criteria together Resource title URI last update language MESH topic target audience language initial criteria additional criteria Contact address phone name e-mail Author Responsible www address Sponsor policy Content provider profile MESH keyword virt. con. segment advertisement seal information source privacy policy accessibility warning Medical org.

6 Information extraction toolkit - architecture INFORMATION EXTRACTION TOOLKIT IE Engines IE Engine 1 (EXO) WP7 Labeled corpora (type B) Documents with assigned n-best classes WP4 Labeling schemas IE Engine 2 (ML) WP5 Integrator Data Model Manager IE Engine 3 (STA) Pre- processor UI Expert’s domain and extraction knowledge IE Engine 0 (NER) Task Manager UI Visualiser WP5 Repository of previously extracted items Annotated documents Extracted attributes, instances Annotation tool UI WP5 Repository of previously extracted items MUA user components admin components

7 Information extraction toolkit – document flow IE Engine 1 (EXO) IE Engine 2 (ML) IE Engine 3 (STA) Pre- processor IE Engine 0 (NER) classified document select extraction model based on document class extracted attributes and instances extract attributes, add them to document extract attributes, extract instances based on attributes, add them to document

8 Extraction engines  3 rd party (NER): LingPipe, Annie, BiOs, JET... –extract attributes –state: tested by UNED  ML extractor –extract attributes –state: developed at NCSR  Statistical text extractor –needed to extract free text paragraphs of certain kind e.g. “about company text”, “privacy policy description” –state: future work; TKK will be the owner  Ex (extraction ontology) extractor –extract attributes –extract instances based on identified attributes –state: developed at UEP document flow

9 Demo  Information Extraction Toolkit –extraction task management task = documents + ex.model + ex.engine definition, load, save, run, monitor progress –can use any IE engine which implements the Engine interface –showing preliminary UI (to be replaced by AQUA)  Ex (extraction ontologies) –contact information sample

10 Next steps  Integration of more extraction engines into IET  Integration of IET into AQUA  Improve –precision and recall –efficiency

Download ppt "University of Economics Prague Information Extraction (WP6) Martin Labský MedIEQ meeting Helsinki, 24th October 2006."

Similar presentations

Ads by Google