University of Economics Prague Ontology-based information extraction: progresses and perspectives of the Ex tool Martin Labský KEG seminar,

Slides:



Advertisements
Similar presentations
Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.
Advertisements

Rulebase Expert System and Uncertainty. Rule-based ES Rules as a knowledge representation technique Type of rules :- relation, recommendation, directive,
Large-Scale Entity-Based Online Social Network Profile Linkage.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Search Engines and Information Retrieval
Aki Hecht Seminar in Databases (236826) January 2009
A Fully Automated Object Extraction System for the World Wide Web a paper by David Buttler, Ling Liu and Calton Pu, Georgia Tech.
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding.
Designing clustering methods for ontology building: The Mo’K workbench Authors: Gilles Bisson, Claire Nédellec and Dolores Cañamero Presenter: Ovidiu Fortu.
Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Overview of Search Engines
Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.
Large-Scale Cost-sensitive Online Social Network Profile Linkage.
Information Retrieval in Practice
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Search Engines and Information Retrieval Chapter 1.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Final Review 31 October WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh.
Survey of Semantic Annotation Platforms
Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.
Information Extraction From Medical Records by Alexander Barsky.
Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
University of Economics Prague Information Extraction (WP6) Martin Labský MedIEQ meeting Helsinki, 24th October 2006.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
1 Technologies for (semi-) automatic metadata creation Diana Maynard.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
Querying Structured Text in an XML Database By Xuemei Luo.
University of Economics Prague 1 Information extraction from web pages using extraction ontologies Martin Labský KEG Seminar, 28 th November 2006.
Math Information Retrieval Zhao Jin. Zhao Jin. Math Information Retrieval Examples: –Looking for formulas –Collect teaching resources –Keeping updated.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
Presenter: Shanshan Lu 03/04/2010
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Project Overview Vangelis Karkaletsis NCSR “Demokritos” Frascati, July 17, 2002 (IST )
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Algorithmic Detection of Semantic Similarity WWW 2005.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 3. Word Association.
ESIP Semantic Web Products and Services ‘triples’ “tutorial” aka sausage making ESIP SW Cluster, Jan ed.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Presented By- Shahina Ferdous, Student ID – , Spring 2010.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
DeepDive Model Dongfang Xu Ph.D student, School of Information, University of Arizona Dec 13, 2015.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Discovering Relations among Named Entities from Large Corpora Takaaki Hasegawa *, Satoshi Sekine 1, Ralph Grishman 1 ACL 2004 * Cyberspace Laboratories.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Department of Computer Science The University of Texas at Austin USA Joint Entity and Relation Extraction using Card-Pyramid Parsing Rohit J. Kate Raymond.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
PAIR project progress report Yi-Ting Chou Shui-Lung Chuang Xuanhui Wang.
Institute of Informatics & Telecommunications NCSR “Demokritos” Spidering Tool, Corpus collection Vangelis Karkaletsis, Kostas Stamatakis, Dimitra Farmakiotou.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
Lecture 15: Text Classification & Naive Bayes
Social Knowledge Mining
Lecture 12: Data Wrangling
Hierarchical, Perceptron-like Learning for OBIE
Ping LUO*, Fen LIN^, Yuhong XIONG*, Yong ZHAO*, Zhongzhi SHI^
Presentation transcript:

University of Economics Prague Ontology-based information extraction: progresses and perspectives of the Ex tool Martin Labský KEG seminar, May 2 9, 2008

ISMIS 2008 Combining Multiple Sources of Evidence in Web IE Agenda 1.Motivation for Web Information Extraction (IE) 2.Difficulties in practical applications 3.Extraction Ontologies 4.Extraction process 5.Experimental results: contact information 6.Future work and Conclusion

ISMIS 2008 Combining Multiple Sources of Evidence in Web IE Motivation for Web IE (1/4): online products

ISMIS 2008 Combining Multiple Sources of Evidence in Web IE Motivation for Web IE (2/4): contact information

ISMIS 2008 Combining Multiple Sources of Evidence in Web IE Motivation for Web IE (3/4): seminars, events

ISMIS 2008 Combining Multiple Sources of Evidence in Web IE Motivation for Web IE (4/4)  Store the extracted results in a DB to enable structured search over documents –information retrieval –database-like querying –e.g. online product search engine –e.g. building a contact DB  Support for web page quality assessment –involved in an EU project MedIEQ to support medical website accreditation agencies  Source documents –internet, intranet, s –can be very diverse

ISMIS 2008 Combining Multiple Sources of Evidence in Web IE Agenda 1.Motivation for Web Information Extraction (IE) 2.Difficulties in practical IE applications 3.Extraction Ontologies 4.Extraction process 5.Experimental results: contact information 6.Future work and Conclusion

ISMIS 2008 Combining Multiple Sources of Evidence in Web IE Difficulties in practical applications (1/3)  Requirements –be able to extract some information quickly not necessarily with the best accuracy often needed for a proof-of-concept application then more work can be done to boost accuracy –the extraction model changes meaning of to-be-extracted items may shift, new items are often added

ISMIS 2008 Combining Multiple Sources of Evidence in Web IE Difficulties in practical applications (2/3)  Training data –most state-of-the-art trainable IE systems require large amounts of training data: these are almost never available –when training data is collected, it is not easy to adapt it to changed or additional criteria –active learning helps reduce training data collection efforts but users often need to spend time annotating trivial examples that could be easily covered by manual rules –this is our experience from experiments with extraction of bicycle descriptions using Hidden Markov Models  Wrappers –cannot rely on wrapper-only systems when extracting from multiple websites –non-wrapper systems often do not utilize regular formatting cues  Purely manual rules –just writing extraction rules manually is not easily extensible when training data are collected in later phases

ISMIS 2008 Combining Multiple Sources of Evidence in Web IE Difficulties in practical applications (3/3)  It seems to be difficult to exploit at the same time –extraction knowledge from domain experts –training data –formatting regularities within a document within a group of documents from the same source  We attempt to address this with the approach of extraction ontologies

ISMIS 2008 Combining Multiple Sources of Evidence in Web IE Agenda 1.Motivation for Web Information Extraction (IE) 2.Difficulties in practical applications 3.Extraction Ontologies 4.Extraction process 5.Experimental results: contact information 6.Future work and Conclusion

ISMIS 2008 Combining Multiple Sources of Evidence in Web IE Extraction ontologies  An extraction ontology is a part of a domain ontology transformed to suit extraction needs  Contains classes composed of attributes –more like UML class diagrams, less like ontologies where e.g. relations are standalone –also contains axioms related to classes or attributes  Classes and attributes are augmented with extraction evidence –manually provided patterns for content and context –value or length ranges –links to external trainable classifiers Person name {1} degree {0-5} {0-2} phone {0-3} Responsible

ISMIS 2008 Combining Multiple Sources of Evidence in Web IE Extraction evidence provided by domain expert (1)  Patterns –for attributes and classes –for their content and context –patterns may be defined at the following levels: word and character-level, formatting tag level level of labels (e.g. sentence breaks, POS tags)  Attribute value constraints –word length constraints, numeric value ranges –possible to attach units to numeric attributes  Axioms –may enforce relations among attributes –interpreted using JavaScript scripting language  Simple co-reference resolution rules

ISMIS 2008 Combining Multiple Sources of Evidence in Web IE Extraction evidence provided by domain expert (2) Axioms  class level  attribute level Patterns  class content  attribute value  attribute context  class context Value constraints  word length  numeric value

ISMIS 2008 Combining Multiple Sources of Evidence in Web IE Extraction evidence from classifiers  Links to trainable classifiers –may classify attributes only –binary or multi-class  Training (if not done externally) uses these features –re-use all evidence provided by expert –induce binary features based on word n-grams  Feature induction –candidate features are all word n-grams of given lengths occurring inside or near training attribute values –pruning parameters: point-wise mutual information thresholds: minimal absolute occurrence count maximum number of features classifier usage classifier definition

ISMIS 2008 Combining Multiple Sources of Evidence in Web IE Probabilistic model to combine evidence  Each piece of evidence E is equipped with 2 probability estimates with respect to predicted attribute A: –evidence precision P(A|E)... prediction confidence –evidence coverage P(E|A)... necessity of evidence (support)  Each attribute is assigned some low prior probability P(A)  Let be the set of evidence applicable to A  Assume conditional independence among :  Using Bayes formula we compute P(A | its evidence values) as: where

ISMIS 2008 Combining Multiple Sources of Evidence in Web IE Agenda 1.Motivation for Web Information Extraction (IE) 2.Difficulties in practical applications 3.Extraction Ontologies 4.Extraction process 5.Experimental results: contact information 6.Future work and Conclusion

ISMIS 2008 Combining Multiple Sources of Evidence in Web IE The extraction process (1/5) 1.Tokenize, build HTML formatting tree, apply sentence splitter, POS tagger 2.Match patterns 3.Create Attribute Candidates (ACs)  For each created AC, let P AC =  prune ACs below threshold  build document AC lattice, score ACs by log(P AC ) Washington, DC...

ISMIS 2008 Combining Multiple Sources of Evidence in Web IE The extraction process (2/5) 4.Evaluate coreference resolution rules for each pair of ACs  e.g. “Dr. Burns”  “John Burns”  possible coreferring groups are remembered  in attribute’s value section: 5.Compute the best scoring path BP through AC lattice  using dynamic programming 6.Run wrapper induction algorithm using all AC  BP  wrapper induction algorithm described in next slides  if new local patterns are induced, apply them to:  rescore existing ACs  create new ACs  update AC lattice, recompute BP 7.Terminate here if no instances are to be generated  output all AC  BP (n-best paths supported)

ISMIS 2008 Combining Multiple Sources of Evidence in Web IE The extraction process (3/5) 8.Generate Instance Candidates (ICs) bottom-up –triangular trellis used to store partial ICs –when scoring new ICs, only consider axioms and patterns that already can be applied to the IC. Validity is not required. –pruning parameters: abs and relative beam size at trellis node, maximum number of ACs that can be skipped, min IC probability

ISMIS 2008 Combining Multiple Sources of Evidence in Web IE The extraction process (4/5) 8.IC generation: continued  When new IC is created, its P(IC) is computed from 2 components: where |IC| is member attribute count, AC skip is an non-member AC that is fully or partially inside the IC, P AC skip is the probability of AC being a “false positive”. where  C is the set of evidence known for the class C, computed using the same probabilistic model as for ACs.  Scores are combined using the Prospector pseudo-bayesian method:

ISMIS 2008 Combining Multiple Sources of Evidence in Web IE The extraction process (5/5) 9.Insert valid ICs into AC lattice  Valid ICs were assembled during IC generation phase  Score of a valid IC reflects all extraction evidence of its class  All unpruned valid ICs are inserted into the AC lattice, scored by 10.The best path BP is calculated through the IC+AC lattice (n-best supported)  the search algorithm allows constraints to be defined over the extracted path(s)  e.g. min/max count of extracted instances  output all ACs and ICs on BP IC1

ISMIS 2008 Combining Multiple Sources of Evidence in Web IE Extraction evidence based on formatting  A simple wrapper induction algorithm –identify formatting regularities –turn them into “local” context patterns to boost contained ACs 1.Assemble distinct formatting subtrees rooted at block elements containing ACs from the best path BP currently determined by the system 2.For each subtree S, calculate 3.If both C(S,Att) and prec(Att|S) reach defined thresholds, a new local context pattern is created with its precision set to C(S,Att) and its recall close to 0 (in order not to harm potential singleton ACs. TD A_hrefB John TD A_hrefB Argentina a formatting tree learned using known names like “John Doe” and applied to unknown names

ISMIS 2008 Combining Multiple Sources of Evidence in Web IE Agenda 1.Motivation for Web Information Extraction (IE) 2.Difficulties in practical applications 3.Extraction Ontologies 4.Extraction process 5.Experimental results: contact information 6.Future work and Conclusion

ISMIS 2008 Combining Multiple Sources of Evidence in Web IE Experimental results: Contact information  109 English contact pages, 200 Spanish, 108 Czech  Named entity counts: 7000, 5000, 11000, respectively, instances not labeled  Only domain expert’s evidence and formatting pattern induction were used  Domain expert saw 30 randomly chosen documents, the rest was test data  Instance extraction done but not evaluated

ISMIS 2008 Combining Multiple Sources of Evidence in Web IE Future work  Confirm that improved results can be achieved when combining expert knowledge and formatting pattern induction with classifiers  Attempt to improve a seed extraction ontology by bootstrapping using relevant pages retrieved from the Internet  Adapt the structure of extraction ontology according to data –e.g. add new attributes to represent product features

ISMIS 2008 Combining Multiple Sources of Evidence in Web IE Conclusions  Presented an extraction ontology approach to –allow for fast prototyping of IE applications –accomodate extraction schema changes easily –utilize all available forms of extraction knowledge domain expert’s knowledge training data formatting regularities found in web pages  Results –indicate that extraction ontologies can serve as a quick prototyping tool –it seems possible to improve performance of the prototyped ontology when training data become available

ISMIS 2008 Combining Multiple Sources of Evidence in Web IE Acknowledgements  The research was partially supported by the EC under contract FP , Knowledge Space of Semantic Inference for Automatic Annotation and Retrieval of Multimedia Content: K-Space.  The medical website application is carried out in the context of the EC-funded (DG-SANCO) project MedIEQ.

ISMIS 2008 Combining Multiple Sources of Evidence in Web IE Backup slides  IET and co.

ISMIS 2008 Combining Multiple Sources of Evidence in Web IE Information extraction toolkit - architecture INFORMATION EXTRACTION TOOLKIT user components admin components IE Engines Labelling schemas Classified documents from WCC Data Model Manager Pre- processor UI Expert’s domain and extraction knowledge, annotated corpora Ex extraction ontology engine Task Manager UI Document IO Annotated documents Extracted attributes, instances Annotation tool UI AQUA Evaluator CRF extraction engine Rule-based integrator (TBD)

ISMIS 2008 Combining Multiple Sources of Evidence in Web IE extract attributes and instances refines extracted values, e.g. based on document classification Information extraction toolkit – document flow Rule-based integrator Extraction ontology engine Pre- processor CRF NE engine classified document select extraction model (s) based on document class extracted attributes and instances

ISMIS 2008 Combining Multiple Sources of Evidence in Web IE Czech contact data set: results countsstrict modeloose mode goldautoprecrecallFprecrecallF title name street city region zip country phone organization department overall

ISMIS 2008 Combining Multiple Sources of Evidence in Web IE Czech dataset: per-attribute F-measures  IET purpose:  to support the user by providing suggestions  not to work standalone without supervision

ISMIS 2008 Combining Multiple Sources of Evidence in Web IE Customization to new criteria  Precisely define the criterion or criteria group –define and give positive and negative examples  If gazetteers required: –search or construct appropriate gazetteers  If training required: –annotate training corpus of at least 100 documents with at least 300 occurrences of the criterion –train one of the trainable extractors: CRF engine Ex with Weka integration  If some extraction evidence can be given by human: –write new or extend an existing extraction ontology  Evaluate performance

ISMIS 2008 Combining Multiple Sources of Evidence in Web IE Localization to a new language  Reuse language independent parts of extraction ontology: –class structure (attributes in a class) –cardinalities, constraints, axioms –some criteria can be reused almost completely (phone, )  If a criterion requires training: –annotate corpus and train classifier as when adding a new criterion  Provide language specific extraction evidence that can be encoded by a human (if any): –add to extraction ontology

ISMIS 2008 Combining Multiple Sources of Evidence in Web IE Demo + tutorial  IET + Ex –free text criteria –(shows internal IET user interface)  Tutorial –

ISMIS 2008 Combining Multiple Sources of Evidence in Web IE New features in Ex IE engine  Significant speed-up  Memory footprint reduction  Multiple class extraction  Extended axiom support  Instance parsing and reference resolution improvements  Extraction ontology authoring made easier