2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.

2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering Laboratory, School of Computer Science, University of Seoul

S E 2007. Software Engineering Laboratory, School of Computer Science 2 1. Overview  Open Source Web Data Extraction tool written in Java  http://web-harvest.sourceforge.net  Web-Harvest 1.0 released! [October 15th, 2007]  offers  a way to collect desired Web pages and extract useful data from them.  Focuses  HTML/XML based web sites.  is not to propose a new method, but to provide a way to easily use and combine the existing extraction technologies.

S E 2007. Software Engineering Laboratory, School of Computer Science 3 2. Basic concept  World Wide Web (as the largest database)  often contains various data.  The problem is  this data is mixed together with formatting code.  way making human-friendly, but not machine-friendly content  characteristics  It could be easily supplemented by custom Java libraries.  Every extraction procedure in Web-Harvest is user-defined through XML-based configuration files.  It describes sequence of processors executing some common task  Processors execute in the form of pipeline.  the output of one processor execution is input to another one

S E 2007. Software Engineering Laboratory, School of Computer Science 4 Configuration language  simple configuration fragment:  When Web-Harvest executes this part of configuration, the following steps occur: 1.http processor downloads content from the specified URL. 2.html-to-xml processor cleans up that HTML producing XHTML content. 3.xpath processor searches specific links in XHTML from previous step giving URL sequence as a result.

S E 2007. Software Engineering Laboratory, School of Computer Science 5 Data values and Variables  All data produced and consumed during extraction process in Web-Harvest have three representations:  text, binary and list.  In previous configuration  html-to-xml processor uses downloaded content as text in order to transform it to HTML.  Web-Harvest provides the variable context for storing and using variables.  When Web-Harvest is programmatically used variable context may be initially set by user in order to add custom values and functionality.  after execution, variable context is available for taking variables from it.

S E 2007. Software Engineering Laboratory, School of Computer Science 6 Backgrounds  How do you create a "database of everything" on the web and ma ke it searchable?  This is the topic of an article by Alon Halevy and other Googlers:  Structured Data Meets the Web: A Few Observations.  The World Wide Web is witnessing an increase in the amount of structured content.

S E 2007. Software Engineering Laboratory, School of Computer Science 7 Backgrounds  The deep web: The deep web refers to content that lies hidden behind queryable HTML forms.  The majority of forms offer search into data that is stored in back-end databases.  Google Base: The second source of structured data on the web, Google Base, is an attempt to enable content owners to upload structured data into Google.  it can be searched.  Annotation schemes: There is a third class of structured data on the web which is the result of a variety of annotation schemes.  Annotation schemes enable users to add tags describing underlying content (e.g., photos) to enable better search over the content.

S E 2007. Software Engineering Laboratory, School of Computer Science 8 Backgrounds  Integrating Structured and Unstructured Data  we consider how structured data is integrated into today's web- search paradigm that is dominated by keyword search.  the approach and challenges:  Deep Web: The typical solution is based on creating a virtual schema for a particular domain and mappings from the fields of the forms in that domain to the attributes of the virtual schema.  At query time, a user fills out a form in the domain of interest and the query is reformulated as queries over all the forms in that domain.

S E 2007. Software Engineering Laboratory, School of Computer Science 9 Backgrounds  Integrating Structured and Unstructured Data  the approach and challenges:  Google Base: Google Base faces a different integration challenge.  Experience has shown that we cannot expect users to come directly to base.google.com to pose queries targeted solely at Google Base.  The vast majority of people are unaware of Google Base and do not understand the distinction between it and the Web index.

S E 2007. Software Engineering Laboratory, School of Computer Science 10 Backgrounds  Integrating Structured and Unstructured Data  Annotation Schemes: Typically, the annotations can be used to improve recall and ranking for resources.  In the case of Google Co-op, customized search engines can specify query patterns that trigger specific facets as well as provide hints for re-ranking search results.  The annotations that any customized search engine specifies are visible only within the context of that search engine.

S E 2007. Software Engineering Laboratory, School of Computer Science 11 Backgrounds  A Database of Everything  Instead of necessarily creating mappings between data sources and a virtual schema, we will rely much more heavily on schema clustering.  Clustering lets us measure how close two schemas are to each other, without actually having to map each of them to a virtual schema in a particular domain.  schemas may belong to many clusters, thereby gracefully handling complex relationships between domains.  Keyword queries will be mapped to clusters of schemas, and at that point we will try to apply approximate schema mappings in order to leverage data from multiple sources to answer queries.

S E 2007. Software Engineering Laboratory, School of Computer Science 12 Demo  Example: http://web-harvest.sourceforge.net

2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.

Similar presentations

Presentation on theme: "2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.

Similar presentations

Presentation on theme: "2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering."— Presentation transcript:

Similar presentations

About project

Feedback