Departmet of Informatics, Univeristy of Huddersfield Information Extraction from the WWW using Machine Learning Techniques Lee McCluskey, Dept of Informatics.

Slides:



Advertisements
Similar presentations
The 20th International Conference on Software Engineering and Knowledge Engineering (SEKE2008) Department of Electrical and Computer Engineering
Advertisements

By Ahmet Can Babaoğlu Abdurrahman Beşinci.  Suppose you want to buy a Star wars DVD having such properties;  wide-screen ( not full-screen )  the extra.
The Semantic Web-Week 22 Information Extraction and Integration (continued) Module Website: Practical this week:
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
0 General information Rate of acceptance 37% Papers from 15 Countries and 5 Geographical Areas –North America 5 –South America 2 –Europe 20 –Asia 2 –Australia.
Machine Learning and the Semantic Web
Research topics Semantic Web - Spring 2007 Computer Engineering Department Sharif University of Technology.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Web Information Retrieval and Extraction Chia-Hui Chang, Associate Professor National Central University, Taiwan
The Semantic Web: Implications for Future Intelligent Systems Lee McCluskey, Artform Research Group, Department of Computing And Mathematical Sciences,
AI Week 22 Machine Learning Data Mining Lee McCluskey, room 2/07
Master’s course Bioinformatics Data Analysis and Tools Lecture 6: Internet Basics Centre for Integrative Bioinformatics.
1 Web Service Integration Michael R. Genesereth Logic Group Stanford University.
1 CIS607, Fall 2006 Semantic Information Integration Instructor: Dejing Dou Week 10 (Nov. 29)
Web Information Retrieval and Extraction Chia-Hui Chang, Associate Professor National Central University, Taiwan Sep. 16, 2005.
The Semantic Web Week 1 Module Content + Assessment Lee McCluskey, room 2/07 Department of Computing And Mathematical Sciences Module.
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
The Semantic Web – introduction to the basic technology Week 2 - XML Lee McCluskey.
Data Warehouse success depends on metadata
The Semantic Web - Week 21 Building the SW: Information Extraction and Integration Module Website: Practical this.
Chapter 14 The Second Component: The Database.
Automatic Data Ramon Lawrence University of Manitoba
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Overview of Search Engines
Web 3.0 or The Semantic Web By: Konrad Sit CCT355 November 21 st 2011.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
A Brief Survey of Web Data Extraction Tools Alberto H. F. Laender, Berthier A. Ribeiro-Neto, Altigran S. da Silva, Juliana S. Teixeira Federal University.
Data Management Turban, Aronson, and Liang Decision Support Systems and Intelligent Systems, Seventh Edition.
Semantic Web outlook and trends May The Past 24 Odd Years 1984 Lenat’s Cyc vision 1989 TBL’s Web vision 1991 DARPA Knowledge Sharing Effort 1996.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
DATA COMMUNICATION DONE BY: ALVIN SAMPATH CARLVIN SAMPATH.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
CPS120: Introduction to Computer Science The World Wide Web Nell Dale John Lewis.
Processing of large document collections Part 10 (Information extraction: multilingual IE, IE from web, IE from semi-structured data) Helena Ahonen-Myka.
1 The BT Digital Library A case study in intelligent content management Paul Warren
Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach by: Craig A. Knoblock, Kristina Lerman Steven Minton, Ion Muslea Presented.
1 INTRODUCTION TO DATABASE MANAGEMENT SYSTEM L E C T U R E
Project MLExAI Machine Learning Experiences in AI Ingrid Russell, University.
Introduction to Web Mining Spring What is data mining? Data mining is extraction of useful patterns from data sources, e.g., databases, texts, web,
The INTERNET how it works. the internet: defined So, what is it?
Natural Language Processing Guangyan Song. What is NLP  Natural Language processing (NLP) is a field of computer science and linguistics concerned with.
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
1 XML An Overview Roger Debreceny University of Hawai`i Skip White University of Delaware XBRL Workshop, August 2006.
EMNLP’01 19/11/2001 ML: Classical methods from AI –Decision-Tree induction –Exemplar-based Learning –Rule Induction –TBEDL ML: Classical methods from AI.
1 A Hierarchical Approach to Wrapper Induction Presentation by Tim Chartrand of A paper bypaper Ion Muslea, Steve Minton and Craig Knoblock.
Presenter: Shanshan Lu 03/04/2010
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
Chapter 9: Structured Data Extraction Supervised and unsupervised wrapper generation.
DataBase and Information System … on Web The term information system refers to a system of persons, data records and activities that process the data.
WEB MINING. In recent years the growth of the World Wide Web exceeded all expectations. Today there are several billions of HTML documents, pictures and.
Knowledge Management: The On-To-Knowledge Project Hans Akkermans Free University Amsterdam VUA.
Majid Sazvar Knowledge Engineering Research Group Ferdowsi University of Mashhad Semantic Web Reasoning.
OWL Representing Information Using the Web Ontology Language.
ICT-enabled Agricultural Science for Development Scenarios, Opportunities, Issues by ICTs transforming agricultural science, research & technology generation.
Copyright Paula Matuszek Kinds of Machine Learning.
Semantic Web COMS 6135 Class Presentation Jian Pan Department of Computer Science Columbia University Web Enhanced Information Management.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
An Ontological Approach to Financial Analysis and Monitoring.
Artificial Intelligence, simulation and modelling.
INTRODUCTION TO INFORMATION SYSTEMS LECTURE 9: DATABASE FEATURES, FUNCTIONS AND ARCHITECTURES PART (2) أ/ غدير عاشور 1.
Semantic Web Technologies Readings discussion Research presentations Projects & Papers discussions.
Introduction Characteristics Advantages Limitations
Lecture #11: Ontology Engineering Dr. Bhavani Thuraisingham
A Shopping Agent for the WWW
Textbook Engineering Web Applications by Sven Casteleyn et. al. Springer Note: (Electronic version is available online) These slides are designed.
Kriti Chauhan CSE6339 Spring 2009
Web Mining Department of Computer Science and Engg.
Presentation transcript:

Departmet of Informatics, Univeristy of Huddersfield Information Extraction from the WWW using Machine Learning Techniques Lee McCluskey, Dept of Informatics

Departmet of Informatics, Univeristy of Huddersfield Motivation General: The WWW is a virtually limitless mass of information aimed mainly for human consumption. It is desirable to make this information generally available for use by computer programs in order to provide higher levels of to service to people. This supports the new area of “Semantic Technologies” – apparently the new “billion dollar” market.. NOW: Desk Top + Client-Server Technologies COMING: Distributed Intelligent Services Specific: This work is related to a Knowledge Transfer Partnership just starting with a local company called View Based Systems.

Departmet of Informatics, Univeristy of Huddersfield Overview of Talk We will investigate Information Extraction: This is the process of extracting “meaningful” data from raw or semi-structured text We will investigate techniques from ‘similarity- based’ Machine Learning to learn/extract meaning from traditional web page content Also, Information Agents: These are programs that can retrieve information from web sites using database-like queries and can integrate info from web sites to solve complex queries

Departmet of Informatics, Univeristy of Huddersfield Information Extraction from the WWW – WHY? Problem: You’re on ebay and you want a toilet cistern & wash basin that have a combined width of under 90cm Solution: waste all Sunday afternoon going through 673 entries for “toilet” looking for widths and cross checking with 923 entries for wash basin! n Need a universally-recognised query language n Need to avoid the problems of identity (!) with universally- accessible vocabularies n Need to be able to reason with acquired knowledge

Departmet of Informatics, Univeristy of Huddersfield Information Extraction from the WWW – WHY? Our (KTP) interest – extract data from www related to a “theme” or subculture eg bee-keeping, role playing games, Northern Soul music.. We want to populate and maintain a central database with this information …

Departmet of Informatics, Univeristy of Huddersfield Information Extraction from The Web n Information extraction is the process of extracting “meaningful” data from raw or semi-structured text n IE tasks form a spectrum.. “Feature Extraction” - extract a particular piece of data from a semi- or unstructured document and give it an XML markup eg extract an address from an html web page. “Natural Language Understanding” - take raw (English) text from a web page and turn into some logic representing its meaning. EASIER HARDER

Departmet of Informatics, Univeristy of Huddersfield Information Extraction from The Web tom664blueBSc bill345greyPhD dave123redMSc sue555redBA WRAPPERS WEB PAGES STRUCTURED DATA

Departmet of Informatics, Univeristy of Huddersfield Information Extraction n The Web’s HTML content makes it difficult to retrieve and integrate data from multiple sources. n An agent can use a wrapper to extract the information from the collection of similarly-looking Web pages. n The wrapper ~ grammar of the data in the web site + code to utilize the grammar n This is similar to turning the HTML => XML+ grammar (DTD)

Departmet of Informatics, Univeristy of Huddersfield Example of Automated Extraction Hebden Bridge West Yorkshire UK £350,000 Bijou residence on the edge of this popular little town Residential Housing House For Sale location: Hebden Bridge agent-phone: listed-price: £350,000 comments: Bijou residence on the edge of this popular little town... House For Sale Source: HTML ======> Destination: XML NB: XML + schema + recognised names wrapper

Departmet of Informatics, Univeristy of Huddersfield Information Extraction How can we create wrappers to ‘extract meaningful data’ from the current Web? ?? Write a wrapper to extract data …. BUT would have to write a tool for every type of data / every type of webpage eg a C program to process every eBay page on toilets and output widths. No - This is far too specific! ?? Write a tool to learn wrappers by inducing the format of web pages and/or particular fields... this is more general and maintainable

Departmet of Informatics, Univeristy of Huddersfield Using ‘Rule Induction’ to learn wrappers for html pages n The user is given or acquires ‘typical examples’ of the web pages containing the content to be learned n The user points out fields to be learned to the agent. n The agent builds up a characterization of the formats from the examples and transforms this into a wrapper in the form of a set of rules n The wrapper is used by the agent to recognize and extract data from similar web pages

Departmet of Informatics, Univeristy of Huddersfield Rule Induction is an area of Machine Learning Machine Learning Similarity-Based Learning Explanation-Based Learning Neural Networks Learning from Examples Learning by Observation Rule Induction Symbolic Learning Sub-symbolic learning Genetic Approaches

Departmet of Informatics, Univeristy of Huddersfield Rule Induction from Examples Roughly, the algorithm is as follows: Input: a (large) number of +ve instances (examples) of concept C + (possibly) a number of –ve instances of C Output: a characterization H of the examples forming the rule H => C

Departmet of Informatics, Univeristy of Huddersfield Actual IE Example: University of Southern California’s Info Sciences Institute (ISI)’s “Information agent” SPECIFIC PROBLEM: travel planning using the Web as an information source. There are huge number of travel sites, with different types of information. - hotel and flight information, - airports that are closest to your destination, - directions to your hotel - weather in the destination city …ETC Information Agents are capable of retrieving and integrating info from web sites to solve complex queries or tasks eg “book my travel for my business trip next week” See the Heracles project (

Departmet of Informatics, Univeristy of Huddersfield Heracles’ Stalker inductive algorithm This generates wrappers – in this case rules that identify the start and end of an item within a web page. It uses EXAMPLES A HIERARCHICAL MODEL (ONTOLOGY) OF WHAT TO EXPECT IN A WEB PAGE

Departmet of Informatics, Univeristy of Huddersfield Example of training examples Stalker is given examples of ‘items’ it had to learn the wrapper for – eg examples of the item (or concept) “area code” of a tel no, E1: 513 Pixco, Venice, Phone: E2: 90 Colfax, Palms, Phone: ( 818 ) E3: 523 1st St., LA, Phone: E4: 403 La Tijera, Watts, Phone: ( 310 ) Stalker learns wrappers that detect the begin/end patterns of fields so that they can be used to ‘mine’ data in unseen web pages

Departmet of Informatics, Univeristy of Huddersfield Problems with Wrapper Induction ISI report some success with their travel Information Agent, and its IE process, BUT: n Wrapper Brittleness – website format may change – maintenance is costly n Background knowledge (token hierarchy) not strong n Unsupervised Wrapper induction would be better

Departmet of Informatics, Univeristy of Huddersfield Summary - Information Extraction is the process of extracting “meaningful” data from raw or semi-structured text - Wrappers are programs (rules) which are attached to web pages to extract data - Machine Learning techniques can be used to create wrappers - There are still many problems with these methods – especially in the learning and maintaining of wrappers

Departmet of Informatics, Univeristy of Huddersfield Extra Reading n n Learning to Extract Symbolic Knowledge from the World Wide Web. M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam and S. Slattery. AAAI-98. January n “Hierarchical Wrapper Induction for Semi- structured Information Sources” Ion Muslea, Steven Minton, Craig A. Knoblock, Kluwer, n See Kushmerick references – apparently he invented wrapper induction

Departmet of Informatics, Univeristy of Huddersfield Related Legal/ Ethical/ Professional/ Methodological Issues n Is it legal and/or ethical to automatically ‘harvest’ data from the www and re-use or sell it? In what cases is it illegal? n How does one automate checking the veracity of www data? n Will website owners conceal their data if the practice becomes widespread? n Future: do we really want distributed web intelligence?