Webpage Understanding: an Integrated Approach

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

Automatic Timeline Generation from News Articles Josh Taylor and Jessica Jenkins.
Linh Harvesting useful data from researchers’ homepages.
1/1/ A Knowledge-based Approach to Citation Extraction Min-Yuh Day 1,2, Tzong-Han Tsai 1,3, Cheng-Lung Sung 1, Cheng-Wei Lee 1, Shih-Hung Wu 4, Chorng-Shyong.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
An Unsupervised Framework for Extracting and Normalizing Product Attributes from Multiple Web Sites Center for E-Business Technology Seoul National University.
ELPUB 2006 June Bansko Bulgaria1 Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Department of Computer.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.
Person Name Disambiguation by Bootstrapping Presenter: Lijie Zhang Advisor: Weining Zhang.
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.
1 Block-based Web Search Deng Cai *1, Shipeng Yu *2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft Research Asia 1 Tsinghua University 2 University of Munich.
IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.
Xyleme A Dynamic Warehouse for XML Data of the Web.
Aki Hecht Seminar in Databases (236826) January 2009
Image Search Presented by: Samantha Mahindrakar Diti Gandhi.
Traditional Information Extraction -- Summary CS652 Spring 2004.
Visualization of AAG Paper Abstracts André Skupin Dept. of Geography University of New Orleans AAG Pittsburgh, April 5, 2000.
Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.
Xiaomeng Su & Jon Atle Gulla Dept. of Computer and Information Science Norwegian University of Science and Technology Trondheim Norway June 2004 Semantic.
Retrieving Location-based Data on the Web Andrei Tabarcea,
1 A Topic Modeling Approach and its Integration into the Random Walk Framework for Academic Search 1 Jie Tang, 2 Ruoming Jin, and 1 Jing Zhang 1 Knowledge.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Towards Automatic Structured Web Data Extraction System Tomas Grigalis, 2nd year PhD student Scientific supervisor: prof. habil. dr. Antanas Čenys.
Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
Researcher affiliation extraction from homepages I. Nagy, R. Farkas, M. Jelasity University of Szeged, Hungary.
Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.
Ta Nha Linh 1TIM13 March 2009 Harvesting useful information on researchers' home pages Ta Nha Linh Supervisor: Asst. Prof. Min-Yen Kan.
User Behavior Analysis of Location Aware Search Engine Third international Conference of MDM, 2002 Takahiko Shintani, Iko Pramudiono NTT Information Sharing.
FREERIDE: System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Ge Yang Gagan Agrawal Department of Computer and Information.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
Algorithmic Detection of Semantic Similarity WWW 2005.
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Juan D.Velasquez Richard Weber Hiroshi Yasuda 國立雲林科技大學 National.
Cold Start Problem in Movie Recommendation JIANG CAIGAO, WANG WEIYAN Group 20.
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang.
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
Data Mining and Decision Support
Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.
TRANS: T ransportation R esearch A nalysis using N LP Technique S Hyoungtae Cho, Melissa Egan, Ferhan Ture Final Presentation December 9, 2009.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) Hierarchical Cost-sensitive Web Resource Acquisition.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
1 Object-Level Vertical Search CIDR, Jan 9, 2007 Zaiqing Nie Microsoft Research Asia With Ji-Rong Wen and Wei-Ying Ma.
Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.
PAIR project progress report Yi-Ting Chou Shui-Lung Chuang Xuanhui Wang.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Information Extractors Hassan A. Sleiman. Author Cuba Spain Lebanon.
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
Experience Report: System Log Analysis for Anomaly Detection
Based on Menu Information
Web Data Extraction Based on Partial Tree Alignment
Data Mining: Concepts and Techniques Course Outline
Web Page Cleaning for Web Mining
Panagiotis G. Ipeirotis Luis Gravano
Ping LUO*, Fen LIN^, Yuhong XIONG*, Yong ZHAO*, Zhongzhi SHI^
Extracting Why Text Segment from Web Based on Grammar-gram
Presentation transcript:

Webpage Understanding: an Integrated Approach Jun Zhu, Zaiqing Nie, Ji-Rong Wen, Bo Zhang, and Hsiao-Wuen Hon Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University

Integrated webpage understanding model Outline Motivating Examples Windows Live Product Search Libra Academic Search Integrated webpage understanding model Statistical structure mining model Integrated structure mining and text processing model Experiments Conclusions Here the contents of this talk. I will start with a real-world application of Windows Live Product Search to introduce the idea of Web object extraction. And I also discuss the existing decoupled record detection and attribute labeling methods and show their disadvantages. Then I will talk about our proposed simultaneous record detection and attribute labeling. I will also show some experiments.

Motivating Examples Windows Live Product Search (http://products.live.com) A single extractor for all product webpages Template-Independent Libra Academic Search (http://libra.msra.cn) Extraction of web objects like Authors, Papers and Conferences is one key technique More examples: CiteSeer, Google Scholar, Rexa… Web data extraction is an information extraction (IE) task that identifies information of interest from webpages. For example, in David’s homepage we can find his name, title, affiliation, and published papers. Another example is the product information on the web for sale. Identify this information is apparently of great significance for data mining and information retrieval. However, several characteristics make this task challenging. On webpages, the data have more structures but less grammars. For example, the data have 2D spatial information like size and position. The text contents are always text fragments and are lack of natural language processing grammars. Also, the large-scale and heterogeneous webpages require that the methods must be scalable enough and easily adapted. Template-dependent methods cannot be applied because they heavily rely on the design patterns or templates. Existing work on template-independent methods include the partial tree alignment algorithm of Zhai in 2005. And our work in ICML 05, KDD’06 and KDD’07. Let’s talk a little more about these methods.

Characteristics of Webpage Huge number, Heterogeneous Some structures Hierarchical HTML tag tree Visually two-dimensional layout Related information clusters Special formats (like table, list, and etc.) Lack of grammars Many text contents are fragments Lack of the natural language grammars Attribute missing Before going ahead, let’s see what characteristics of the webpages make the problem non-trival. First, the number is huge, and the data are heterogeneous. So, it’s not easy to develop a general model. Second, the data on the web do have some structures: for example,

Tasks of Web Data Extraction Structure Mining: Data Record Detection Group the related information about the same object together, and distinguish unrelated information Informative Block Detection An informative block can contains interested information about several objects Data Field Extraction Identify HTML elements as attribute values Text content processing Identify purified target information Go into HTML elements: Tokens are atomic units Generally, web data extraciton can be grouped as two types. The first is structure mining. Typical tasks include data record detection, informative block detection, and data fields extraction. Data record detection is to group the related information about the same object together. For example, this is a paper record. Informative block detection is like record detection. An informative block can contains several data records and other interested information. Data field extraction is performed at the HTML element level. It identifies the element values. The second type of web data extraction is text content processing. In structure mining, HTML elements are atomic extraction units, however, on webpages, an HTML element can contains multiple attributes. E.g. this element contains three authors. Text processing is to identify each author name. Author Authors

In current, there is no integrated structure mining and text content processing model

Existing Attempts – De-coupled Approaches Learning to Extract Text-based Information from the World Wide Web. KDD’97 Heuristic-based pre-processing Use page layout cues to parse web pages into logically coherent segments NLP flat text extraction CRYSTAL NLP system Using HTML Formatting to Aid in Natural Language Processing on the World Wide Web. CMU Senior Honors Thesis’98 Construct a struct tree for an input web page Extraction rules are coded based on the tree representation To adopt NLP systems to process texts on Web, traditional methods use some pre-processing to parse Web data. Here are two examples……

De-coupled approach is ineffective [Zhu et al., 2006] Disadvantages Pre-processing is non-trivial because of the diversity of web pages and various domains De-coupled approach is ineffective [Zhu et al., 2006] Not convincing that parsed text fragments have the grammars as required in NLP systems Typical NLP methods ignore structure formats

Why no integrated approach? Much work on page layout and format analysis, but little work on text content processing Wrapper-based methods are difficult to scale Typical natural language processing methods are not directly applicable Fragmental web text contents are lack of grammars Structure mining methods are mainly rule-based while NLP methods are statistical Not easy to merge!

Integrated webpage understanding model Outline Motivations Windows Live Product Search Libra Integrated webpage understanding model Statistical structure mining model Integrated structure mining and text processing model Experiments Conclusions Here the contents of this talk. I will start with a real-world application of Windows Live Product Search to introduce the idea of Web object extraction. And I also discuss the existing decoupled record detection and attribute labeling methods and show their disadvantages. Then I will talk about our proposed simultaneous record detection and attribute labeling. I will also show some experiments.

Statistical Web Structure Mining Model (KDD 2006) Joint Optimization of Record Detection and Labeling of HTML Elements Mutual Enhancement Vision-based Page Representation Vision-Tree Hierarchically segment a webpage according to its Visual Layout Consider both Record Detection and Attribute Labeling as labeling tasks on the vision-tree Hierarchical Conditional Random Fields (HCRFs) Fixed model structure Junction tree algorithm for parameter estimation and labeling The de-coupled methods are far from perfect. For example, the error propagation. Lack of semantics in record detection. Lack of mutual interactions in attribute labeling. Local Markov assumption. To address these problems, we propose the hierarchical model. The model is just an instance of our proposed Dynamic Hierarchical Markov Random Fields. In this model, we can jointly do Web object extraction and overcome all the above limitations. Of course, this model suffers from the blocky artifact issue. Actually this is the main motivation we propose the dynamic model. About this model and also the application, more details can be found in our KDD 2006 paper.

Integrated Webpage Understanding Model Basic Idea For structure mining, we use a full hierarchical CRF For text processing, we use semi-Markov Random Field models Integrated Model Let leaf nodes act as the bridge between two types of models

Factorized Distribution Definitions Variables in HCRFs: Segmentation of leaf nodes: Segment: Segmentation: Vector of segmentation: A valid assignment: Factorized distribution:

Separate Learning The integrated model can be separately learned by MLE Each model has its learning method Setup separate training datasets

Integrated webpage understanding model Outline Motivations Windows Live Product Search Libra Integrated webpage understanding model Statistical structure mining model Integrated structure mining and text processing model Experiments Conclusions Here the contents of this talk. I will start with a real-world application of Windows Live Product Search to introduce the idea of Web object extraction. And I also discuss the existing decoupled record detection and attribute labeling methods and show their disadvantages. Then I will talk about our proposed simultaneous record detection and attribute labeling. I will also show some experiments.

Experiments A New real-world dataset Methods evaluated Job corpus, Address and Paper Corpora contains only text documents and no HTML structures We build a new real-world dataset on researchers’ homepages Researchers’ homepages contain both structure patterns and large amount of text contents needing to be processed 292 homepages are randomly downloaded from the computer science departments of about 10 universities, like CMU, MIT, UC Berkeley and Stanford; also from research labs, like MSR, IBM Research, AT&T Bell lab. 13 attributes are identified: Methods evaluated De-coupled sequential method HCRFs for structure mining, and Semi-CRFs for text segmentation and labeling Integrated method with/without NLP-Chunking features Evaluation Metrics Precision, Recall, F1 on each attribute Here are the setup of our experiments

Extraction Accuracy Integrated webpage understanding model performs better on most of the attributes De-coupled baseline achieves higher recalls on Title, Affiliation, Street , Region, and Zip code but lower precisions due to its failure to consider global spatial information

NP-Chunking Features NP-chunking helps Author significantly, while causes slight decreases on Name, Title and Affiliation We use NP-chunking results to introduce additional features.

Conclusions & Future Work Webpage understanding should consider both structure and text contents Structure mining can help text processing in the integrated extraction model NLP features can help in the integrated model Future Work Study more NLP features Study webpages with more free texts, like news webpages, etc. Ok. Here are the conclusions. First, we propose a Dynamic Hierarchical Markov Random Field model, which is a generalization of Conditional Random Fields to incorporate structure uncertainty in hierarchical modeling. We show that the model has more flexible in encoding evidence, and has more disriminative power. It’s more appropriate for diverse web data extraction. We also show that the approximate learning algorithm is empirically stable and efficient.

Welcome to this evening’s poster session for more discussions. Thank you for your time & attention! Welcome to this evening’s poster session for more discussions. Poster board No.: 12

Text Features Visual Features Database Features NLP Features Features Patterns of the text contents, like contain “,”, or “;’, and etc. Visual Features Position, size, color, and etc. Database Features Matching with DBLP records provides additional cues NLP Features Use shallow NP-Chunking to introduce additional features E.g. a noun phrase is more likely to be class X

Database Features Database features can help in general De-coupled baseline can be improved much with database features Integrated understanding model can perform well even without database features