Download presentation
Presentation is loading. Please wait.
Published byFelix Strickland Modified over 9 years ago
1
Effective Web Data Extraction with Standard XML Technologies Source : International World Wide Web Conference Proceedings of the tenth international conference on World Wide Web Hong Kong Pages: 689 - 696 Year of Publication: 2001 Author : Jussi Myllymaki Jussi Myllymaki
2
Outline Introduction Problem ANDES architecture Conclusion and future work
3
Introduction 在這篇論文中,主要是討論如何從網站 中精鍊出資料的問題,以及提出一個以 XML 為基礎的方法來解決 data extraction 的問題。 In this paper we focus on systems- oriented issues in Web data extraction and describe our approach for building a dependable extraction process.
4
navigation problem : finding target HTML pages on a site by following hyperlinks. data extraction problem : extracting relevant pieces of data from these pages. structure synthesis problem : distilling the data and improving its structured-ness. data mapping problem : ensuring data homogeneity. data integration problem : merging data from separate HTML pages. Extracting structured data requires solving five problems
5
Extracting structured data from web sites Web site navigation Data extraction Hyperlink synthesis Structure synthesis Data mapping Data integration
6
Web site navigation In the ANDES data extraction framework, viewing Web sites as consisting of two types of HTML pages: target HTML pages and navigational HTML pages. ANDES uses the Grand Central Station (GCS) as a crawler, it is a flexible and extensible crawler framework developed at the IBM Almaden Research Center.
7
Data extraction First step in data extraction is to translate the content to a well-formed XML syntax. The specific approach taken in the ANDES framework is to pass the original HTML page through a filter to produce XHTML. The first XSLT file merely extracts data from XHTML page, while subsequent XSLT files in the pipeline can refine the data and fill in missing data from domain knowledge.
8
Hyperlink synthesis One shortcoming of today ’ s crawlers is only follow static hyperlinks but not dynamic hyperlinks that are a result of HTML forms and JavaScript code.
9
Structure synthesis What makes this difficult is that a Web site may not provide enough structure to make direct mapping to an XML structure possible. In ANDES, missing data can be filled in by XSLT code that encapsulates domain knowledge.
10
Data mapping Mapping discrete values into a standard format improves the quality of the extracted data. Homogenization of discrete values and measured values is performed in ANDES with a combination of conditional statements, regular expressions, and domain-specific knowledge encapsulated in the XSLT code.
11
Data integration Why this is necessary? Some web sites use HTML frames of layout, which breaks up a logical data unit into separate HTML documents. Some web sites break up the data across multiple “ sibling pages ” so as not to overload a single page with too much information. Solving steps: The original HTML documents are crawled normally and data is extracted from them individually. Concatenating these partial outputs into one output and passing the resulting file through an XSLT filter the merges related data.
12
The overview of ANDES (A Nifty Data Extraction System) architecture
13
ANDES 這個架構包含了五個元件: data retriever : ANDES 的預設 data retriever 是 Grand Central Station ( GCS ) crawler ‧而 GCS 從網站中擷取出 target HTML pages ;而這些網頁被傳送到 extractor 。 extractor :它是用來執行 data extraction, structure synthesis, data mapping functions, and data integration functions. checker : extractor 所產生出來的 output XML 文件將被轉交給 checker ,來檢查所送交來的資料是否正確。 exporter : ANDES 所預設的 data exporter 是將 XML 資料轉換 relational tuples 和將這些 tuples 插入一個 JDBC 的資料庫。 scheduler/manager interface : scheduler 的責任是用來 trigger the data extraction 在 predefined times 和定時地重覆做 extraction 的動作。而 web-based management interface 是一個系 統管理者,用來 monitor and control ANDES 。
14
Conclusion and future work The paper expect to use the XML Schema syntax for expressing data validation rules.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.