The Semantic Web - Week 21 Building the SW: Information Extraction and Integration Module Website: Practical this.

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
1 Initial Results on Wrapping Semistructured Web Pages with Finite-State Transducers and Contextual Rules Chun-Nan Hsu Arizona State University.
By Ahmet Can Babaoğlu Abdurrahman Beşinci.  Suppose you want to buy a Star wars DVD having such properties;  wide-screen ( not full-screen )  the extra.
The Semantic Web-Week 22 Information Extraction and Integration (continued) Module Website: Practical this week:
Departmet of Informatics, Univeristy of Huddersfield Information Extraction from the WWW using Machine Learning Techniques Lee McCluskey, Dept of Informatics.
Information Retrieval in Practice
The Semantic Web: Implications for Future Intelligent Systems Lee McCluskey, Artform Research Group, Department of Computing And Mathematical Sciences,
16/13/2015 3:30 AM6/13/2015 3:30 AM6/13/2015 3:30 AMIntroduction to Software Development What is a computer? A computer system contains: Central Processing.
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
AI Week 22 Machine Learning Data Mining Lee McCluskey, room 2/07
CS 330 Programming Languages 09 / 18 / 2007 Instructor: Michael Eckmann.
The Semantic Web Week 18: Part 4 Introduction to Web Services and Intelligent Web Agents Module Website: Practical.
1 CIS607, Fall 2006 Semantic Information Integration Instructor: Dejing Dou Week 10 (Nov. 29)
The Semantic Web – introduction to the basic technology Week 2 - XML Lee McCluskey.
Chapter 1 Program Design
Overview of Search Engines
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 13 Slide 1 Application architectures.
Teaching and Learning with Technology  Allyn and Bacon 2002 Administrative Software Chapter 5 Teaching and Learning with Technology.
Programming Languages: Telling the Computers What to Do Chapter 16.
IS432: Semi-Structured Data Dr. Azeddine Chikh. 1. Semi Structured Data Object Exchange Model.
Selecting and Combining Tools F. Duveau 02/03/12 F. Duveau 02/03/12 Chapter 14.
Data Management Turban, Aronson, and Liang Decision Support Systems and Intelligent Systems, Seventh Edition.
Ontology Alignment/Matching Prafulla Palwe. Agenda ► Introduction  Being serious about the semantic web  Living with heterogeneity  Heterogeneity problem.
16-1 The World Wide Web The Web An infrastructure of distributed information combined with software that uses networks as a vehicle to exchange that information.
CPS120: Introduction to Computer Science The World Wide Web Nell Dale John Lewis.
Processing of large document collections Part 10 (Information extraction: multilingual IE, IE from web, IE from semi-structured data) Helena Ahonen-Myka.
1 The BT Digital Library A case study in intelligent content management Paul Warren
CSC1401: Introductory Programming Steve Cooper
CMPS 3223 Theory of Computation Automata, Computability, & Complexity by Elaine Rich ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Slides provided.
Funded by: European Commission – 6th Framework Project Reference: IST WP 2: Learning Web-service Domain Ontologies Miha Grčar Jožef Stefan.
Artificial intelligence project
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Text Based Information Retrieval.
Administrative Software Chapter 7 Teaching and Learning with Technology.
Ontology-Based Information Extraction: Current Approaches.
XML A web enabled data description language 4/22/2001 By Mark Lawson & Edward Ryan L’Herault.
AI Week 14 Machine Learning: Introduction to Data Mining Lee McCluskey, room 3/10
Presenter: Shanshan Lu 03/04/2010
Linking Tasks, Data, and Architecture Doug Nebert AR-09-01A May 2010.
Recuperação de Informação B Cap. 10: User Interfaces and Visualization , , 10.9 November 29, 1999.
GREGORY SILVER KUSHEL RIA BELLPADY JOHN MILLER KRYS KOCHUT WILLIAM YORK Supporting Interoperability Using the Discrete-event Modeling Ontology (DeMO)
[ Part III of The XML seminar ] Presenter: Xiaogeng Zhao A Introduction of XQL.
WEB MINING. In recent years the growth of the World Wide Web exceeded all expectations. Today there are several billions of HTML documents, pictures and.
David Adams ATLAS DIAL/ADA JDL and catalogs David Adams BNL December 4, 2003 ATLAS software workshop Production session CERN.
LOGO 1 Mining Templates from Search Result Records of Search Engines Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Hongkun Zhao, Weiyi.
OWL Representing Information Using the Web Ontology Language.
Trustworthy Semantic Webs Dr. Bhavani Thuraisingham The University of Texas at Dallas Lecture #4 Vision for Semantic Web.
Database Management Supplement 1. 2 I. The Hierarchy of Data Database File (Entity, Table) Record (info for a specific entity, Row) Field (Attribute,
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
Web mining is the use of data mining techniques to automatically discover and extract information from Web documents/services
INTRODUCTION TO COMPUTER PROGRAMMING(IT-303) Basics.
Mathematical Service Matching Using Description Logic and OWL Kamelia Asadzadeh Manjili
Information Extractors Hassan A. Sleiman. Author Cuba Spain Lebanon.
Your Interactive Guide to the Digital World Discovering Computers 2012 Chapter 13 Computer Programs and Programming Languages.
A Mixed-Initiative System for Building Mixed-Initiative Systems Craig A. Knoblock, Pedro Szekely, and Rattapoom Tuchinda Information Science Institute.
Semantic Web Technologies Readings discussion Research presentations Projects & Papers discussions.
© 2017 by McGraw-Hill Education. This proprietary material solely for authorized instructor use. Not authorized for sale or distribution in any manner.
Introduction to Algorithm. What is Algorithm? an algorithm is any well-defined computational procedure that takes some value, or set of values, as input.
Information Retrieval in Practice
Database System Concepts and Architecture
CSCI-235 Micro-Computer Applications
Text Based Information Retrieval
Lecture #11: Ontology Engineering Dr. Bhavani Thuraisingham
A Shopping Agent for the WWW
Information Retrieval
Administrative Software
Overview of Machine Learning
The ultimate in data organization
Presentation transcript:

The Semantic Web - Week 21 Building the SW: Information Extraction and Integration Module Website: Practical this week:

Recall Slide from Week 18 on Content Acquisition: - the semantic web needs populating with content – how can this be done given people in general don’t understand description logic / FOL ??????? - There are two types of content – - A. NEW knowledge - B. OLD information in existing, structured formats - We concentrated on A – most content initially at least will be through B - we will return to this later… - Actually we can also acquire semantic web content from OLD information in existing semi/un- structured formats eg HTML form..

Information Extraction n Information extraction is the process of extracting “meaningful” data from raw or semi-structured text n Two extremes: u “Natural Language Understanding” - take raw (English) text and turn into some logic representing its meaning. F In SW terms, Raw text => OWL u “Feature Extraction” - extract particular a piece of data from a semi- or unstructed document eg extract an address from a standard web page. F In SW terms, HTML => XML

Information Extraction Example: You’re on ebay and you want a toilet cistern & wash basin that have a combined width of under 90cm Solution: waste all Sunday afternoon going through 673 entries for “toilet” looking for widths and cross checking with 923 entries for wash basin! The Web’s HTML content makes it difficult to retrieve and integrate data from multiple sources. Information Agents are capable of retrieving info from some web sites via database-like queries (such as required in the example above. ) The Agent uses a wrapper to extract the information from the collection of similarly-looking Web pages. The wrapper ~ grammar of the data in the web site + code to utilize the grammar This is equivalent to turning the HTML => XML+DTDs !!

Example of Automated Extraction Hebden Bridge West Yorks UK #350,000 Bijou residence on the edge of this popular little town Residential Housing House For Sale location: Hebden Bridge agent-phone: listed-price: #350,000 comments: Bijou residence on the edge of this popular little town... House For Sale Source: HTML ======> Destination: XML

Information Integration Example: Consider the problem of travel planning on the Web. There are huge number of travel sites, with different types of information. Site1 hotel and flight information, Site 2 airports that are closest to your destination, Site 3 a third site to get directions to your hotel Site 4 weather in the destination city ETC Information Agents are capable of retrieving and integrating info from web sites to solve complex queries ISI built an ‘information agent’ which performs this function. See University of Southeren California’s Info Sciences Institute (ISI): Heracles project ( The technology is based on Information Extraction + Integration

Information Extraction How can we create tools to ‘extract meaningful data’ from the current Web for (a) Populating the SW? (b) Inputting to information agents? (1) Write a tool to extract data …. BUT would have to write a tool of every type of data / every type of webpage eg a C program to process every eBay page on toilets and output width. This is far too specific! (2) ISI’s idea: Write a tool to ‘learn’ the format of web pages and/or particular fields. User is given or acquires ‘good examples’ of web pages. User points out fields to be learned. Tool builds up a characterisation of the formats from the examples and uses this to recognize and extract data from similar web pages

Similarity-based Learning Algorithms that ‘learn’ = Machine Learning Similarity-Based Learning Explanation-Based Learning Neural Networks Learning from Examples Learning by Observation Rule Induction Symbolic Learning Sub-symbolic learning Genetic Algorithms

Inductive Learning – Rule Induction from Examples Roughly, the algorithm is as follows: Input: a (large) number of +ve instances (examples) of concept C + (possibly) a number of –ve instances of C Output: a characterization H of the examples forming the rule H => C

Inductive Learning – JARGON Learning rule: H => C, H is a ‘hypothesis’ -- H COVERS an instance x if x satisfies H. -- H1 is a GENERALISATION of H2 if H2 satisfies H1 -- H is CONSISTENT if it covers no –ve instance -- H is COMPLETE if it covers all +ve instances -- H is CHARACTERISTIC if it is complete and consistent -- H is a MAXIMAL GENERALISATION if it is the most specific complete hypothesis. Example: features – a, b, c, d, e +ve instances of concept C: d&b&c, a&b&c&d, e&c&b&a -ve instances: a&b&e, d&e Give examples of consistent, complete and maximal hypotheses

Inductive Learning – Learning features in html pages In ML the input can be strings (leaning a grammar) or assertions. In Document Feature Extraction order is important Document Examples can be represented by sequences of tokens.

Token hierarchy (from ISI’s travel assistant) e.g. Example 1: ALPHA CAPS Example 2: NUM LOWER Inductive hypothesis: Title-tag ALPHANUM EndTitleTag ALPHA

Exercises 1.Find examples of consistent, complete and maximal hypotheses (if they exist) of the following example features: (a) +ve instances of concept C: e&b&c, f&b&c&d, e&c&b&a -ve instances: a&b&e, d&e&b (b) +ve instances of concept C: b&c, a&b&d, e&c&b&a, c&b, e -ve instances: a&b&e, d&e (c) +ve instances of concept C: d&b&c, a&b&c&d, e&c&b&a&d -ve instances: a&b&e&d, d&e&d 2. Run the demonstration of information extraction via the following website: 3. Look at the source of the some web sites containing ‘regular’ data and fields and see if there are any learnable patterns that could help extract data. E.g.Amazon books: author; eBay toilet cisterns: dimensions