Towards Automatic Structured Web Data Extraction System Tomas Grigalis, 2nd year PhD student Scientific supervisor: prof. habil. dr. Antanas Čenys.

Towards Automatic Structured Web Data Extraction System Tomas Grigalis, 2nd year PhD student Scientific supervisor: prof. habil. dr. Antanas Čenys

Outline Introduction The ClustVX approach Experiments Conclusions

Stuctured Web Data

TitleModelPrice Fuji FinePix Z110EXR 14MP562/6283£119.99 Fujifilm XP30 14MP Waterproof559/5101£129.99 Samsung ST200F Smart559/7635£111.99 Database Table with stuctured data Data Record Browser Rendered view in a web browser Web server

The GOAL Stuctured data Unsupervised and domain independent stuctured web data extraction system Web pages with structured data

Key Problems Web pages with visually similar appearance usually have totally different underlying HTML source code There are millions of web pages with different design and HTML source code WEB 2.0 introduced asynchronous JavaScript HTTP requests (AJAX), that modifies HTML source code on-the-fly

The ClustVX approach ClustVX is based on two fundamental observations: 1)Vast amount of information on the Web is presented using fixed templates and filled with data from underlying databases. 2)Although the templates and underlying data differ from site to site, humans understand it easily by analyzing repeating visual patterns on a given Web page

HTML TREE

Repeating patterns in HTML TREE (1 st observation)

Data which has the same semantic meaning is visualized using the same style (2 nd observation) PRICE

ClustVX: First, cluster visually similar web page elements

ClustVX: Second, analyze clusters to identify data records

Experiments: Data Sets To evaluate ClustVX approach we use the following three publicly available benchmark datasets containing in total of 7098 data records: These data sets contain web search result pages generated from databases

Experiments: Evaluation We use the precision and recall measures (which are widely used in information retrieval field) to evaluate the performance of ClustVX system

Experiments: Results We compare the evaluation results of ClustVX system to other state-of-the-art automatic structured web data extraction systems. As shown in the following table, where the best results are marked in bold, ClustVX consistently outperforms other approaches.

Conclusions We presented ClustVX system, which, by exploiting visual and structural features of web page elements, extracts structured data. The preliminary evaluation of ClustVX on three publicly available benchmark data sets demonstrated, that our method can achieve very high quality in terms of precision and recall. Our future work will be concentrated on creating a new huge benchmark data set to test the applicability of this system in real world settings

Thank you, Questions?

Towards Automatic Structured Web Data Extraction System Tomas Grigalis, 2nd year PhD student Scientific supervisor: prof. habil. dr. Antanas Čenys.

Similar presentations

Presentation on theme: "Towards Automatic Structured Web Data Extraction System Tomas Grigalis, 2nd year PhD student Scientific supervisor: prof. habil. dr. Antanas Čenys."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Towards Automatic Structured Web Data Extraction System Tomas Grigalis, 2nd year PhD student Scientific supervisor: prof. habil. dr. Antanas Čenys.

Similar presentations

Presentation on theme: "Towards Automatic Structured Web Data Extraction System Tomas Grigalis, 2nd year PhD student Scientific supervisor: prof. habil. dr. Antanas Čenys."— Presentation transcript:

Similar presentations

About project

Feedback