1.2.2006WP6 – Information Extraction Introduction to MedIEQ Quality Labelling of Medical Web content using Multilingual Information Extraction

Slides:



Advertisements
Similar presentations
© NCSR, Paris, December 5-6, 2002 WP1: Plan for the remainder (1) Ontology Ontology  Enrich the lexicons for the 1 st domain based on partners remarks.
Advertisements

SEO Best Practices with Web Content Management Brent Arrington, Services Developer, Hannon Hill Morgan Griffith, Marketing Director, Hannon Hill 2009 Cascade.
University of Economics Prague - UEP 1 MedIEQ Web Spider and Link scoring component Marek Ruzicka Project meeting TKK, Helsinki, Finland 23.October.2006.
Level 2 Award in Social Networking for Business Day 2 Tutor: Alan Jarvis.
Research topics Semantic Web - Spring 2007 Computer Engineering Department Sharif University of Technology.
1 Question Answering in Biomedicine Student: Andreea Tutos Id: Supervisor: Diego Molla.
Information Retrieval in Practice
Information Retrieval Review
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
Automatic Web Page Categorization by Link and Context Analysis Giuseppe Attardi Antonio Gulli Fabrizio Sebastiani.
Presented by Li-Tal Mashiach Learning to Rank: A Machine Learning Approach to Static Ranking Algorithms for Large Data Sets Student Symposium.
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
A Semantic Sommelier as an Ontology-powered Mobile Social Application and a Pedagogical Tool Deborah L. McGuinness and Evan W. Patton.
Information Extraction and Ontology Learning Guided by Web Directory Authors:Martin Kavalec Vojtěch Svátek Presenter: Mark Vickers.
A New Web Semantic Annotator Enabling A Machine Understandable Web BYU Spring Research Conference 2005 Yihong Ding Sponsored by NSF.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Overview of Search Engines
What’s The Difference??  Subject Directory  Search Engine  Deep Web Search.
Erasmus University Rotterdam Introduction With the vast amount of information available on the Web, there is an increasing need to structure Web data in.
Aurora: A Conceptual Model for Web-content Adaptation to Support the Universal Accessibility of Web-based Services Anita W. Huang, Neel Sundaresan Presented.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Ontology Development Kenneth Baclawski Northeastern University Harvard Medical School.
Processing of large document collections Part 10 (Information extraction: multilingual IE, IE from web, IE from semi-structured data) Helena Ahonen-Myka.
Institute of Informatics and Telecommunications – NCSR “Demokritos” Bootstrapping ontology evolution with multimedia information extraction C.D. Spyropoulos,
FLAVIUS Technical presentation (Overblog, Qype, TVTrip) - WP2 Platform architecture.
XHTML Introductory1 Linking and Publishing Basic Web Pages Chapter 3.
Survey of Semantic Annotation Platforms
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Search Engine Optimization ext 304 media-connection.com The process affecting the visibility of a website across various search engines to.
Knowledge Representation and Indexing Using the Unified Medical Language System Kenneth Baclawski* Joseph “Jay” Cigna* Mieczyslaw M. Kokar* Peter Major.
Question Answering.  Goal  Automatically answer questions submitted by humans in a natural language form  Approaches  Rely on techniques from diverse.
University of Economics Prague Information Extraction (WP6) Martin Labský MedIEQ meeting Helsinki, 24th October 2006.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.
A bad case of content reuse Validator Website to Validate License Violations Validator – Only requires the URI of the site to check This work by Oshani.
A bad case of content reuse Validator Website to Validate License Violations Validator – Only requires the URI of the site to check for a license violation.
Experiences at the school of Medicine in Bamako by Abda Anne Retrieving medical information on the internet.
Evaluation of information. Introduction It is common for people to challenge things they learn It is known that not every information is true Medical.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Management of Digital Content in Business Environments Constantine D. Spyropoulos Director of Institute of Informatics & Telecommunications NCSR “Demokritos”
Object-Oriented Software Engineering using Java, Patterns &UML. Presented by: E.S. Mbokane Department of System Development Faculty of ICT Tshwane University.
Project Overview Vangelis Karkaletsis NCSR “Demokritos” Frascati, July 17, 2002 (IST )
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
SKOS. Ontologies Metadata –Resources marked-up with descriptions of their content. No good unless everyone speaks the same language; Terminologies –Provide.
August 2005 TMCOps TMC Operator Requirements and Position Descriptions Phase 2 Interactive Tool Project Presentation.
Search Engines By: Faruq Hasan.
Trustworthy Semantic Webs Dr. Bhavani Thuraisingham The University of Texas at Dallas Lecture #4 Vision for Semantic Web.
© NCSR, Frascati, July 18-19, 2002 WP1: Plan for the remainder (1) Ontology Ontology  Use of PROTÉGÉ to generate ontology and lexicons for the 1 st domain.
1 MedAT: Medical Resources Annotation Tool Monika Žáková *, Olga Štěpánková *, Taťána Maříková * Department of Cybernetics, CTU Prague Institute of Biology.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
Today’s Lesson….. 1.Formative Assessment Given Back – Go through Answers. 2.Webpage Design.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Institute of Informatics & Telecommunications NCSR “Demokritos” Spidering Tool, Corpus collection Vangelis Karkaletsis, Kostas Stamatakis, Dimitra Farmakiotou.
The Big 6 Model for Effective Research While Researching specific topics and how they work you will be using the Big 6 Model for Effective Research to.
WP1: Plan for the remainder (1) Ontology –Finalise ontology and lexicons for the 2 nd domain (RTV) Changes agreed in Heraklion –Improvement to existing.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
NCSR “Demokritos” Institute of Informatics & Telecommunications CROSSMARC CROSS-lingual Multi Agent Retail Comparison WP3 Multilingual and Multimedia Fact.
Information Retrieval in Practice
 Corpus Formation [CFT]  Web Pages Annotation [Web Annotator]  Web sites detection [NEACrawler]  Web pages collection [NEAC]  IE Remote.
Global Coordination Platform
Institute of Informatics & Telecommunications NCSR “Demokritos”
Global Coordination Platform
(VIP-EDC) Point 6 of the agenda
Clustering Semantically Enhanced Web Search Results
ece 627 intelligent web: ontology and beyond
Information Retrieval and Web Design
AI Discovery Template IBM Cloud Architecture Center
Presentation transcript:

WP6 – Information Extraction Introduction to MedIEQ Quality Labelling of Medical Web content using Multilingual Information Extraction Martin Labský Knowledge Engineering Group (KEG) University of Economics Prague (UEP)

WP6 – Information Extraction2 Purpose of MedIEQ Medical web sites are increasingly popular Content strongly affects users’ decisions Therefore, quality labeling is very important Agencies invest large effort into labeling websites manually We develop tools to minimize their effort Tools will be multi-lingual, will support different and evolving labeling criteria

WP6 – Information Extraction3 Agenda Partners Description of relevant work packages [3] –Web content collection, Information Extraction, Lexical and semantic resources –Goals, tasks, partners –Existing tools (to be extended) –New tools (to be developed) –Existing resources (to be made accessible) Milestones & deliverables References Questions

WP6 – Information Extraction4 Partners Agencies –WMA: Web Médica Acreditata (Es) assigns a quality label that is shown on medical websites websites ask for the label, are suggested changes, then get it –AQUMED: Agency for Quality Labeling in medicine (De) maintains a web directory organized by topics only good-quality websites are present Developers –NCSR Demokritos and I-Sieve (spin-off) (Gr) –UEP: University of Economics Prague (Cz) –UNED: National University of Distance Education (Es) –HUT: Helsinki University of Technology (Fi)

WP6 – Information Extraction5 Web Content Collection (WP5)

WP6 – Information Extraction6 Website monitoring Regular visits to labeled website Checking pages –for relevant changes –which changes are relevant? manual rules, machine learning... –alert agency when significant changes occur –or, increase the website’s (web page’s) priority in a list of to-be-checked resources –show what has changed, suggest solution Needed by WMA, AQuMed

WP6 – Information Extraction7 Web focused crawling Find new medical websites Use multiple existing search engines –specify lists of keywords / keyphrases –give sample “similar” documents –use Google/Yahoo API and filter their results NCSR already has a focused crawler –we should contribute to its development Needed by WMA

WP6 – Information Extraction8 Website spidering Walk pages of a single website Classify each page –in order to choose relevant docs for quality labeling –e.g. contact page, page containing treatment description, page with sponsors –use machine learning, e.g. based on a bag-of-words (unigram, bigram) document representation Spidering strategy –which documents belong together (e.g. page 1/7) –which links to follow next NCSR has a spider –uses classifiers from Weka for doc classification –we should contribute

WP6 – Information Extraction9 Information Extraction (WP6)

WP6 – Information Extraction10 IE introduction Documents to extract from –pages retrieved & classified by spider from known websites from crawler –monitored labeled pages that have changed Information to be extracted –derived from agencies’ labeling criteria –e.g. contact information of responsible persons, sponsor names, privacy warning texts... Questions –how much human intervention needed? –complexity of label sets to be supported? –methodology of porting to a new language?

WP6 – Information Extraction11 Example extracted information I. Transparency and honesty –site provider (company name, contact) –site purpose, type of target audience –funding (grants, sponsors) Authority –source citation for information provided, its type and date –names and credentials of all information providers Privacy and data protection –privacy policy description Timeliness of information –dates of publication/modification Accountability –names (and roles) of people responsible for presented information –editorial policy description

WP6 – Information Extraction12 Example extracted information II. Content –medical terms, e.g. disease and drug names –statements recommending a certain product/method –advertisements –disallowed combinations (e.g. advertisement for X adjacent to an article related to X) Formal –mandatory statements (e.g. importance of physical examination, privacy warnings when posting data into chats)

WP6 – Information Extraction13 Sources of extraction knowledge Training data –scarcity will be a problem for most extracted attributes –different types: labeled documents, sample extracted data, data previously extracted from the same website, domain dictionaries Extraction patterns –induced (semi)automatically from scarce training data –or even authored manually Background domain knowledge –relations between extracted attributes, cardinalities... –e.g. typically just one company is the web site’s provider, but there are often multiple sponsors Web site structure –exploit common formatting of a group of documents within a website –exploit common formatting used for a particular type of extracted data across different websites

WP6 – Information Extraction14 IE tools Ex (UEP) –IE system under development using “extraction ontologies” –extracts instances from semi-structured documents –utilizes training data + manually defined patterns, includes spider –old version based on HMMs – Named entity recognizer (UNED) –extracts dates, person/institution names 3 rd party IE tools –wrapper management systems –e.g. LP 2 -based IE tool or annotation editor from Sheffield

WP6 – Information Extraction15 Website assessment Check website’s technical correctness –SEO (findability in search engines with respect to some keyphrases) –accessibility (possibility of font enlargement, blind access, pages hidden deep in website structure, color schemes perceivable by anybody) –formal correctness (dead links, violations of HTML standards, failure to display well under at least the 3 most popular browsers) Check non-technical correctness –e.g. typos, “clear, easy-to-understand language” –more: check for black-listed phrases, claims, etc.

WP6 – Information Extraction16 Website assessment tools Relaxed (UEP) –HTML validator based on Relax NG and Schematron patterns –can perform formal checks of website content beyond DTDs – SEO tool (UEP) –could Honza’s SEO tool be extended?

WP6 – Information Extraction17 IE Deliverables Duration: M1-M28 Deliverables –D8: Methodology & architecture of IE (M9) –D9.1: First version of IE toolkit (M15) –D9.2: Final version of IE toolkit (M24)

WP6 – Information Extraction18 Lexical and semantic resources (WP7)

WP6 – Information Extraction19 Lexical and semantic resources Sp, De, En, Cz, Gr, Fi, Catalan (7!) We are in charge of Cz, De(!) Semantic –thesauri, ontologies (MESH) –lists of cures, vaccine names, lists of medical companies, illnesses, diagnoses –generic ontologies and translation dictionaries (e.g. Eurowordnet) Lexical –lemmatizers/morphology analyzers, part-of-speech taggers, chunkers, syntactic parsers –medical document collections (for classification)

WP6 – Information Extraction20 References MedIEQ: – – Related projects: –WRAPIN –Quatro –CROSSMARC Relaxed: – Ex: – Ellogon: –

WP6 – Information Extraction21 Questions ?