Machine Learning for Information Integration on the Web Georgios Paliouras Software & Knowledge Engineering Lab Inst. of Informatics & Telecommunications.

Slides:



Advertisements
Similar presentations
You have been given a mission and a code. Use the code to complete the mission and you will save the world from obliteration…
Advertisements

Simplifications of Context-Free Grammars
PDAs Accept Context-Free Languages
ALAK ROY. Assistant Professor Dept. of CSE NIT Agartala
Advanced Piloting Cruise Plot.
1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.
Requirements Engineering Process
Chapter 1 The Study of Body Function Image PowerPoint
Copyright © 2013 Elsevier Inc. All rights reserved.
OvidSP Flexible. Innovative. Precise. Introducing OvidSP Resources.
Experiences in Evaluation and Selection of Ontologies Bruno Grilo INESC-ID H. Sofia Pinto IST/INESC-ID
17 Copyright © 2005, Oracle. All rights reserved. Deploying Applications by Using Java Web Start.
Designing Services for Grid-based Knowledge Discovery A. Congiusta, A. Pugliese, Domenico Talia, P. Trunfio DEIS University of Calabria ITALY
1 EnviroInfo 2006, 05/09/06 Graz Automatic Concept Space Generation in Support of Resource Discovery in Spatial Data Infrastructures Paul Smits, Anders.
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Title Subtitle.
Create an Application Title 1Y - Youth Chapter 5.
CALENDAR.
My Alphabet Book abcdefghijklm nopqrstuvwxyz.
CHAPTER 18 The Ankle and Lower Leg
Year 6 mental test 5 second questions
1/23 Learning from positive examples Main ideas and the particular case of CProgol4.2 Daniel Fredouille, CIG talk,11/2005.
|epcc| NeSC Workshop Open Issues in Grid Scheduling Ali Anjomshoaa EPCC, University of Edinburgh Tuesday, 21 October 2003 Overview of a Grid Scheduling.
The 5S numbers game..
Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,
A Fractional Order (Proportional and Derivative) Motion Controller Design for A Class of Second-order Systems Center for Self-Organizing Intelligent.
PNS: Personalized Multi-Source News Delivery Georgios Paliouras(1), Mouzakidis Alexandros(1), Christos Ntoutsis(2), Angelos Alexopoulos(3), Christos Skourlas(2)
ITEC200 Week04 Lists and the Collection Interface.
ABC Technology Project
In The Name Of Allah, The Most Beneficent, The Most Merciful
VOORBLAD.
1 CS 391L: Machine Learning: Rule Learning Raymond J. Mooney University of Texas at Austin.
Chapter 5 Microsoft Excel 2007 Window
FAFSA on the Web Preview Presentation December 2013.
MaK_Full ahead loaded 1 Alarm Page Directory (F11)
1 Termination and shape-shifting heaps Byron Cook Microsoft Research, Cambridge Joint work with Josh Berdine, Dino Distefano, and.
Artificial Intelligence
Before Between After.
Addition 1’s to 20.
Pasewark & Pasewark Microsoft Office XP: Introductory Course 1 INTRODUCTORY MICROSOFT WORD Lesson 8 – Increasing Efficiency Using Word.
25 seconds left…...
Subtraction: Adding UP
1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)
Chapter 2 Entity-Relationship Data Modeling: Tools and Techniques
Week 1.
We will resume in: 25 Minutes.
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)
McGraw-Hill©The McGraw-Hill Companies, Inc., 2001 Chapter 16 Integrated Services Digital Network (ISDN)
From Model-based to Model-driven Design of User Interfaces.
1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)
© NCSR, Paris, December 5-6, 2002 WP1: Plan for the remainder (1) Ontology Ontology  Enrich the lexicons for the 1 st domain based on partners remarks.
On the Need to Bootstrap Ontology Learning with Extraction Grammar Learning Kassel, 22 July 2005 Georgios Paliouras Software & Knowledge Engineering Lab.
Institute of Informatics and Telecommunications – NCSR “Demokritos” Bootstrapping ontology evolution with multimedia information extraction C.D. Spyropoulos,
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Project Overview Vangelis Karkaletsis NCSR “Demokritos” Frascati, July 17, 2002 (IST )
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Maintaining Information Integration Ontologies Georgios Paliouras, Alexandros Valarakos, Georgios Paliouras, Vangelis Karkaletsis, Georgios Sigletos, Georgios.
© NCSR, Frascati, July 18-19, 2002 WP1: Plan for the remainder (1) Ontology Ontology  Use of PROTÉGÉ to generate ontology and lexicons for the 1 st domain.
NCSR “Demokritos” Institute of Informatics & Telecommunications CROSSMARC CROSS-lingual Multi Agent Retail Comparison Costas Spyropoulos & Vangelis Karkaletsis.
Institute of Informatics & Telecommunications NCSR “Demokritos” Spidering Tool, Corpus collection Vangelis Karkaletsis, Kostas Stamatakis, Dimitra Farmakiotou.
WP1: Plan for the remainder (1) Ontology –Finalise ontology and lexicons for the 2 nd domain (RTV) Changes agreed in Heraklion –Improvement to existing.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
NCSR “Demokritos” Institute of Informatics & Telecommunications CROSSMARC CROSS-lingual Multi Agent Retail Comparison WP3 Multilingual and Multimedia Fact.
Institute of Informatics & Telecommunications NCSR “Demokritos”
Presentation transcript:

Machine Learning for Information Integration on the Web Georgios Paliouras Software & Knowledge Engineering Lab Inst. of Informatics & Telecommunications NCSR Demokritos Dagstuhl, February 15, 2005

Dagstuhl 15/2/2005 Machine Learning for Information Integration 2 SKEL Introduction Areas of research activity: –Information gathering (retrieval, crawling, spidering) –Information filtering (text and multimedia classification) –Information extraction (named entity recognition and classification, role identification, wrappers, grammar and lexicon learning) –Personalization (user stereotypes and communities) SKELs research objective: innovative knowledge technologies for reducing the information overload on the Web

Dagstuhl 15/2/2005 Machine Learning for Information Integration 3 Structure of the talk Web Information integration in CROSSMARC Learning Context Free Grammars Meta-learning for Web Information Extraction Machine Learning for Ontology Maintenance Conclusions

Dagstuhl 15/2/2005 Machine Learning for Information Integration 4 SKEL Introduction National Centre for Scientific Research "Demokritos (GR) University of Edinburgh (UK) Universita di Roma Tor Vergata (IT) VeltiNet A.E. (GR) Lingway (FR) CROSSMARC consortium

Dagstuhl 15/2/2005 Machine Learning for Information Integration 5 CROSSMARC Objectives crawl the Web for interesting Web pages, extract information from pages of different sites without a standardized format (structured, semi-structured, free text), process Web pages written in several languages, be customized semi-automatically to new domains and languages, deliver integrated information according to personalized profiles. Develop technology for Information Integration that can:

Dagstuhl 15/2/2005 Machine Learning for Information Integration 6 CROSSMARC Architecture Ontology

Dagstuhl 15/2/2005 Machine Learning for Information Integration 7 CROSSMARC Ontology … Laptops Processor Processor Name Intel Pentium 3 … Intel Pentium III Pentium III P3 PIII Lexicon Ontology Όνομα Επεξεργαστή Greek Lexicon

Dagstuhl 15/2/2005 Machine Learning for Information Integration 8 Structure of the talk Web Information integration in CROSSMARC Learning Context Free Grammars Meta-learning for Web Information Extraction Machine Learning for Ontology Maintenance Conclusions

Dagstuhl 15/2/2005 Machine Learning for Information Integration 9 Learning Context Free Grammars Infers context-free grammars. Learns from positive examples only. Overgenarisation controlled through a heuristic, based on MDL. Two basic/three auxiliary learning operators. Two search strategies: –Beam search. –Genetic search. Introducing eg-GRIDS

Dagstuhl 15/2/2005 Machine Learning for Information Integration 10 Learning Context Free Grammars Minimum Description Length (MDL) Model Length (ML) = GDL + DDL Bits required to encode the grammar G. Grammar Description Length (GDL) Bits required to encode all training examples, as encoded by the grammar G. Derivations Description Length (DDL) Overly Specific Grammar Overly General Grammar DDL Hypothese s GDL

Dagstuhl 15/2/2005 Machine Learning for Information Integration 11 Learning Context Free Grammars eg-GRIDS Architecture Operator Mode Beam of Grammars Merge NT Operator Create NT Operator Learning Operators Create Optional NT Detect Center Embedding YES NO Evolutionary Algorithm Mutation Search Organisation Selection Body Substitution Training Examples Overly Specific Grammar Final Grammar Any Inferred Grammar better than those in beam?

Dagstuhl 15/2/2005 Machine Learning for Information Integration 12 Structure of the talk Web Information integration in CROSSMARC Learning Context Free Grammars Meta-learning for Web Information Extraction Machine Learning for Ontology Maintenance Conclusions

Dagstuhl 15/2/2005 Machine Learning for Information Integration 13 D \ D j DjDj Meta-learning for Web IE Base-level dataset D L 1 …L N MD j Meta-level dataset MD C 1 (j)…C N (j) CMCM New vector x C 1...C N Meta-level vector Class value y(x) L 1 …L N LMLM Stacked generalization

Dagstuhl 15/2/2005 Machine Learning for Information Integration 14 Meta-learning for Web IE …TransPort ZX 15" XGA TFT Display Intel Pentium III 600 MHZ 256k Mobile processor 256 MB SDRAM up to 1GB… Information Extraction is not naturally a classification task In IE we deal with text documents, paired with templates Template T t(s,e)s, eField f Transport ZX47, 49model 1556, 58screenSize TFT59, 60screenType Intel Pentium III63, 67procName 600 MHz67, 69procSpeed 256 MB76, 78ram Each template is filled with instances

Dagstuhl 15/2/2005 Machine Learning for Information Integration 15 Meta-learning for Web IE T 1 filled by the IE system E 1 t(s, e)s, ef Transport ZX47, 49model 1556, 58screenSize TFT59, 60screenType Intel Pentium III63, 67procName 600 MHz67, 69procSpeed 256 MB76, 78ram 1 GB81, 83ram T 2 filled by the IE system E 2 t(s, e)s, ef Transport ZX47, 49manuf TFT59, 60screenType Intel Pentium63, 66procName 600 MHz67, 69procSpeed 256 MB76, 78ram 1 GB81, 83HDcapacity …TransPort ZX 15" XGA TFT Display Intel Pentium III 600 MHZ 256k Mobile processor 256 MB SDRAM up to 1GB… Combining Information Extraction systems

Dagstuhl 15/2/2005 Machine Learning for Information Integration 16 Meta-learning for Web IE Stacked template (ST) s, et(s, e)Field by E 1 Field by E 2 Correct field 47, 49Transport ZXmodelmanufmodel 56, 5815screenSize- 59, 60TFTscreenType 63, 66Intel Pentium-procName- 63, 67Intel Pentium IIIprocName- 67, MHzprocSpeed 76, MBram 81, 831 GBramHDcapacity- Creating a stacked template …TransPort ZX 15" XGA TFT Display Intel Pentium III 600 MHZ 256k Mobile processor 256 MB SDRAM up to 1GB…

Dagstuhl 15/2/2005 Machine Learning for Information Integration 17 D \ D j Meta-learning for Web IE Training in the new stacking framework DjDj L 1 …L N E 1 (j)…E N (j) CMCM ST 1 ST 2 … L 1 …L N E 1 …E N LMLM MD j D = set of documents, paired with hand-filled templates MD = set of meta-level feature vectors

Dagstuhl 15/2/2005 Machine Learning for Information Integration 18 Meta-learning for Web IE Stacking at run-time New document d E1E1 E2E2 ENEN … T1T1 T2T2 TNTN Stacked template CMCM T Final template

Dagstuhl 15/2/2005 Machine Learning for Information Integration 19 Structure of the talk Web Information integration in CROSSMARC Learning Context Free Grammars Meta-learning for Web Information Extraction Machine Learning for Ontology Maintenance Conclusions

Dagstuhl 15/2/2005 Machine Learning for Information Integration 20 Ontology Enrichment Highly evolving domain (e.g. laptop descriptions) –New Instances characterize new concepts. e.g. Pentium 2 is an instance that denotes a new concept if it doesnt exist in the ontology. –New surface appearance of an instance. e.g. PIII is a different surface appearance of Intel Pentium 3 We concentrate on instances. The poor performance of many Information Integration systems is due to their incapability to handle the evolving nature of the domain they cover.

Dagstuhl 15/2/2005 Machine Learning for Information Integration 21 Ontology Enrichment Multi-Lingual Domain Ontology Additional annotations Validation Ontology Enrichment / Population Domain Expert Annotating Corpus Using Domain Ontology Information extraction machine learning Corpus

Dagstuhl 15/2/2005 Machine Learning for Information Integration 22 Enrichment with synonyms The number of instances for validation increases with the size of the corpus and the ontology. There is a need for supporting the enrichment of the synonymy relationship. Discover automatically different surface appearances of an instance (CROSSMARC synonymy relationship). Issues to be handled: Synonym : Intel pentium 3 - Intel pIII Orthographical : Intel p3 - intell p3 Lexicographical : Hewlett Packard - HP Combination : Intell Pentium 3 - P III

Dagstuhl 15/2/2005 Machine Learning for Information Integration 23 Compression-based Clustering COCLU (COmpression-based CLUstering): a model based algorithm that discovers typographic similarities between strings (sequences of elements-letters) over an alphabet (ASCII characters) employing a new score function CCDiff. CCDiff is defined as the difference in the code length of a cluster (i.e., of its instances), when adding a candidate string. Huffman trees are used as models of the clusters. COCLU iteratively computes the CCDiff of each new string from each cluster implementing a hill-climbing search. The new string is added to the closest cluster, or a new cluster is created (threshold on CCDiff ).

Dagstuhl 15/2/2005 Machine Learning for Information Integration 24 Structure of the talk Web Information integration in CROSSMARC Learning Context Free Grammars Meta-learning for Web Information Extraction Machine Learning for Ontology Maintenance Conclusions

Dagstuhl 15/2/2005 Machine Learning for Information Integration 25 SKEL Introduction Information integration can benefit from machine learning. Grammar learning methods have become efficient. Combining IE systems improves performance. Ontologies can be used to annotate examples to learn IE systems and enrich ontologies. Grammar learning in parallel/combination to ontology learning? Conclusions

Dagstuhl 15/2/2005 Machine Learning for Information Integration 26 SKEL Introduction This is research of many current and past members of SKEL. CROSSMARC is joint work of the project consortium. Acknowledgements

Dagstuhl 15/2/2005 Machine Learning for Information Integration 27 Announcement IJCAI workshop Workshop on Grammatical Inference Applications: Successes and Future Challenges IJCAI-05, Edinburgh, Scotland July 31, 2005 Paper submission deadline: March 19, 2005 URL: