Departmet of Informatics, Univeristy of Huddersfield Information Extraction from the WWW using Machine Learning Techniques Lee McCluskey, Dept of Informatics
Departmet of Informatics, Univeristy of Huddersfield Motivation General: The WWW is a virtually limitless mass of information aimed mainly for human consumption. It is desirable to make this information generally available for use by computer programs in order to provide higher levels of to service to people. This supports the new area of “Semantic Technologies” – apparently the new “billion dollar” market.. NOW: Desk Top + Client-Server Technologies COMING: Distributed Intelligent Services Specific: This work is related to a Knowledge Transfer Partnership just starting with a local company called View Based Systems.
Departmet of Informatics, Univeristy of Huddersfield Overview of Talk We will investigate Information Extraction: This is the process of extracting “meaningful” data from raw or semi-structured text We will investigate techniques from ‘similarity- based’ Machine Learning to learn/extract meaning from traditional web page content Also, Information Agents: These are programs that can retrieve information from web sites using database-like queries and can integrate info from web sites to solve complex queries
Departmet of Informatics, Univeristy of Huddersfield Information Extraction from the WWW – WHY? Problem: You’re on ebay and you want a toilet cistern & wash basin that have a combined width of under 90cm Solution: waste all Sunday afternoon going through 673 entries for “toilet” looking for widths and cross checking with 923 entries for wash basin! n Need a universally-recognised query language n Need to avoid the problems of identity (!) with universally- accessible vocabularies n Need to be able to reason with acquired knowledge
Departmet of Informatics, Univeristy of Huddersfield Information Extraction from the WWW – WHY? Our (KTP) interest – extract data from www related to a “theme” or subculture eg bee-keeping, role playing games, Northern Soul music.. We want to populate and maintain a central database with this information …
Departmet of Informatics, Univeristy of Huddersfield Information Extraction from The Web n Information extraction is the process of extracting “meaningful” data from raw or semi-structured text n IE tasks form a spectrum.. “Feature Extraction” - extract a particular piece of data from a semi- or unstructured document and give it an XML markup eg extract an address from an html web page. “Natural Language Understanding” - take raw (English) text from a web page and turn into some logic representing its meaning. EASIER HARDER
Departmet of Informatics, Univeristy of Huddersfield Information Extraction from The Web tom664blueBSc bill345greyPhD dave123redMSc sue555redBA WRAPPERS WEB PAGES STRUCTURED DATA
Departmet of Informatics, Univeristy of Huddersfield Information Extraction n The Web’s HTML content makes it difficult to retrieve and integrate data from multiple sources. n An agent can use a wrapper to extract the information from the collection of similarly-looking Web pages. n The wrapper ~ grammar of the data in the web site + code to utilize the grammar n This is similar to turning the HTML => XML+ grammar (DTD)
Departmet of Informatics, Univeristy of Huddersfield Example of Automated Extraction Hebden Bridge West Yorkshire UK £350,000 Bijou residence on the edge of this popular little town Residential Housing House For Sale location: Hebden Bridge agent-phone: listed-price: £350,000 comments: Bijou residence on the edge of this popular little town... House For Sale Source: HTML ======> Destination: XML NB: XML + schema + recognised names wrapper
Departmet of Informatics, Univeristy of Huddersfield Information Extraction How can we create wrappers to ‘extract meaningful data’ from the current Web? ?? Write a wrapper to extract data …. BUT would have to write a tool for every type of data / every type of webpage eg a C program to process every eBay page on toilets and output widths. No - This is far too specific! ?? Write a tool to learn wrappers by inducing the format of web pages and/or particular fields... this is more general and maintainable
Departmet of Informatics, Univeristy of Huddersfield Using ‘Rule Induction’ to learn wrappers for html pages n The user is given or acquires ‘typical examples’ of the web pages containing the content to be learned n The user points out fields to be learned to the agent. n The agent builds up a characterization of the formats from the examples and transforms this into a wrapper in the form of a set of rules n The wrapper is used by the agent to recognize and extract data from similar web pages
Departmet of Informatics, Univeristy of Huddersfield Rule Induction is an area of Machine Learning Machine Learning Similarity-Based Learning Explanation-Based Learning Neural Networks Learning from Examples Learning by Observation Rule Induction Symbolic Learning Sub-symbolic learning Genetic Approaches
Departmet of Informatics, Univeristy of Huddersfield Rule Induction from Examples Roughly, the algorithm is as follows: Input: a (large) number of +ve instances (examples) of concept C + (possibly) a number of –ve instances of C Output: a characterization H of the examples forming the rule H => C
Departmet of Informatics, Univeristy of Huddersfield Actual IE Example: University of Southern California’s Info Sciences Institute (ISI)’s “Information agent” SPECIFIC PROBLEM: travel planning using the Web as an information source. There are huge number of travel sites, with different types of information. - hotel and flight information, - airports that are closest to your destination, - directions to your hotel - weather in the destination city …ETC Information Agents are capable of retrieving and integrating info from web sites to solve complex queries or tasks eg “book my travel for my business trip next week” See the Heracles project (
Departmet of Informatics, Univeristy of Huddersfield Heracles’ Stalker inductive algorithm This generates wrappers – in this case rules that identify the start and end of an item within a web page. It uses EXAMPLES A HIERARCHICAL MODEL (ONTOLOGY) OF WHAT TO EXPECT IN A WEB PAGE
Departmet of Informatics, Univeristy of Huddersfield Example of training examples Stalker is given examples of ‘items’ it had to learn the wrapper for – eg examples of the item (or concept) “area code” of a tel no, E1: 513 Pixco, Venice, Phone: E2: 90 Colfax, Palms, Phone: ( 818 ) E3: 523 1st St., LA, Phone: E4: 403 La Tijera, Watts, Phone: ( 310 ) Stalker learns wrappers that detect the begin/end patterns of fields so that they can be used to ‘mine’ data in unseen web pages
Departmet of Informatics, Univeristy of Huddersfield Problems with Wrapper Induction ISI report some success with their travel Information Agent, and its IE process, BUT: n Wrapper Brittleness – website format may change – maintenance is costly n Background knowledge (token hierarchy) not strong n Unsupervised Wrapper induction would be better
Departmet of Informatics, Univeristy of Huddersfield Summary - Information Extraction is the process of extracting “meaningful” data from raw or semi-structured text - Wrappers are programs (rules) which are attached to web pages to extract data - Machine Learning techniques can be used to create wrappers - There are still many problems with these methods – especially in the learning and maintaining of wrappers
Departmet of Informatics, Univeristy of Huddersfield Extra Reading n n Learning to Extract Symbolic Knowledge from the World Wide Web. M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam and S. Slattery. AAAI-98. January n “Hierarchical Wrapper Induction for Semi- structured Information Sources” Ion Muslea, Steven Minton, Craig A. Knoblock, Kluwer, n See Kushmerick references – apparently he invented wrapper induction
Departmet of Informatics, Univeristy of Huddersfield Related Legal/ Ethical/ Professional/ Methodological Issues n Is it legal and/or ethical to automatically ‘harvest’ data from the www and re-use or sell it? In what cases is it illegal? n How does one automate checking the veracity of www data? n Will website owners conceal their data if the practice becomes widespread? n Future: do we really want distributed web intelligence?