1 Enabling web browsers to augment web sites’ filtering and sorting functionalities David Huynh · Rob Miller · David Karger MIT Computer Science & Artificial.

Slides:



Advertisements
Similar presentations
Collections Management Software for Museums and Archives r e d i s c o v e r y s o f t w a r e. c o m O V E R V I E W P R E S E N T A T I O N.
Advertisements

Testing Relational Database
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
CWS: A Comparative Web Search System Jian-Tao Sun, Xuanhui Wang, § Dou Shen Hua-Jun Zeng, Zheng Chen Microsoft Research Asia University of Illinois at.
Chapter 11 Designing the User Interface
A Graph-based Recommender System Zan Huang, Wingyan Chung, Thian-Huat Ong, Hsinchun Chen Artificial Intelligence Lab The University of Arizona 07/15/2002.
Overview QW Gateway is a new front-end to QuipWare
DSPIN: Detecting Automatically Spun Content on the Web Qing Zhang, David Y. Wang, Geoffrey M. Voelker University of California, San Diego 1.
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
Xyleme A Dynamic Warehouse for XML Data of the Web.
Semantic Search Jiawei Rong Authors Semantic Search, in Proc. Of WWW Author R. Guhua (IBM) Rob McCool (Stanford University) Eric Miller.
Aki Hecht Seminar in Databases (236826) January 2009
1 CS 430 / INFO 430 Information Retrieval Lecture 15 Usability 3.
Developed by Justin Francisco, SUNY Fredonia USER INTERFACE DESIGN By: Justin Francisco.
Chapter 12: ADO.NET and ASP.NET Programming with Microsoft Visual Basic.NET, Second Edition.
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
FACT: A Learning Based Web Query Processing System Hongjun Lu, Yanlei Diao Hong Kong U. of Science & Technology Songting Chen, Zengping Tian Fudan University.
Web Mining Research: A Survey
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
1 CS 430 / INFO 430 Information Retrieval Lecture 24 Usability 2.
TC 310 The Computer in Technical Communication Dr. Jennifer Turns Week 5, Day 1 (10/28)
DAVID KARGER. Checkered Past Core Algorithms –graph algorithms, randomization, combinatorial optimization –min-cuts, max-flows, shortest paths, minimum.
Chapter 13: Designing the User Interface
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
ARCHIBUS Log On Instructions. Log Into ARCHIBUS Web Central Log In Screen 1.Open your Internet browser. 2.Enter the URL to view the ARCHIBUS Login Page.
Result presentation. Search Interface Input and output functionality – helping the user to formulate complex queries – presenting the results in an intelligent.
Aurora: A Conceptual Model for Web-content Adaptation to Support the Universal Accessibility of Web-based Services Anita W. Huang, Neel Sundaresan Presented.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
Presented by Abirami Poonkundran.  Introduction  Current Work  Current Tools  Solution  Tesseract  Tesseract Usage Scenarios  Information Flow.
Content Analysis Techniques to Ease Browsing with Handhelds Jalal Mahmud Yevgen Borodin I.V. Ramakrishnan Department of Computer Science State University.
Tutorial 121 Creating a New Web Forms Page You will find that creating Web Forms is similar to creating traditional Windows applications in Visual Basic.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
OpenURL Link Resolvers 101
Automatically Extracting Data Records from Web Pages Presenter: Dheerendranath Mundluru
-1- Philipp Heim, Thomas Ertl, Jürgen Ziegler Facet Graphs: Complex Semantic Querying Made Easy Philipp Heim 1, Thomas Ertl 1 and Jürgen Ziegler 2 1 Visualization.
NCSU Libraries Kristin Antelman NCSU Libraries June 24, 2006.
Linking electronic documents and standardisation of URL’s What can libraries do to enhance dynamic linking and bring related information within a distance.
Markup and Validation Agents in Vijjana – A Pragmatic model for Self- Organizing, Collaborative, Domain- Centric Knowledge Networks S. Devalapalli, R.
Learning Patterns on the World Wide Web Andrew Hogue Advisor: David Karger October 17, 2003.
Presenter: Shanshan Lu 03/04/2010
Create Your Own Webpage. Today’s Class Internet Safety & Privacy Tables Embedding music and video Frames.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
May 11, 2005WWW Chiba, Japan1 Thresher: Automating the Unwrapping of Semantic Content from the World Wide Web Andrew Hogue GoogleMIT CSAIL.
Benchmarking ontology-based annotation tools for the Semantic Web Diana Maynard University of Sheffield, UK.
For: CS590 Intelligent Systems Related Subject Areas: Artificial Intelligence, Graphs, Epistemology, Knowledge Management and Information Filtering Application.
Keyword Searching Weighted Federated Search with Key Word in Context Date: 10/2/2008 Dan McCreary President Dan McCreary & Associates
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.
WISPER receives funding from the European Commission’s Information Societies Technology (IST) Programme IST WISPER Dr Gary Randall British Maritime.
Exhibit lightweight structured data publishing david huynh + david karger + rob miller MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Information Retrieval
® IBM Software Group © 2009 IBM Corporation Essentials of Modeling with the IBM Rational Software Architect, V7.5 Module 15: Traceability and Static Analysis.
How to Write an Abstract Gwendolyn MacNairn Computer Science Librarian.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Internet Searching the World Wide Web. The Internet and the World Wide Web The Internet is a worldwide collection of networks that allows people to communicate.
GOOGLE TAG MANAGER. INTRODUCTION Google Tag Manager (GTM) is a free solution, introduced in October Google Tag Manager (GTM) is a free solution,
Activity Design Goal: work from problems and opportunities of problem domain to envision new activities.
Data mining in web applications
David Huynh, Stefano Mazzocchi, David Karger Piggy Bank: Experience the Semantic Web inside your web browser Web Semantics: Science, Services and Agents.
Active Server Pages Computer Science 40S.
Based on Menu Information
Summon discovers contents from one search box!
Web Data Extraction Based on Partial Tree Alignment
CSc4730/6730 Scientific Visualization
Enabling web browsers to augment web sites’ filtering and sorting functionalities David Huynh · Rob Miller · David Karger MIT Computer Science & Artificial.
Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University
Presentation transcript:

1 Enabling web browsers to augment web sites’ filtering and sorting functionalities David Huynh · Rob Miller · David Karger MIT Computer Science & Artificial Intelligence Laboratory UIST 2006 · Montreux, Switzerland

2 Automatic web content scraping (2003 ― now) 1.Zhai, Y., and B. Liu. Web data extraction based on partial tree alignment. WWW Hogue, A. and D. Karger. Thresher: automating the unwrapping of semantic content from the World Wide Web. WWW Reis, D.C., P.B. Golgher, A.S. Silva, and A.F. Laender. Automatic Web news extraction using tree edit distance. WWW Lerman, K., L. Getoor, S. Minton, and C. Knoblock. Using the structure of Web sites for automatic segmentation of tables. SIGMOD Ramaswamy, L., et. al. Automatic detection of fragments in dynamically generated web pages. WWW Wang, J.-Y., and F. Lochovsky. Data extraction and label assignment for Web databases. WWW Arasu, A. and H. Garcia-Molina. Extracting structured data from Web pages. SIGMOD Liu, B., R. Grossman, and Y. Zhai. Mining data records in Web pages. SIGKDD 2003.

3 … but no one has tried to put … Automatic structured web content scraping technologies in the hands of end-users

4 … let’s run through a real task … Paperback books published in 2005 or later by John Grisham on Amazon

5 … that was a demo of putting … Automatic structured web content scraping technologies in the hands of end-users

6 Sifter browser extension

7 Outline Motivations 1.User Interface Design Extraction Augmentation 2.Extraction Algorithm Evaluations 1.Extraction Algorithm 2.User Interface Design Conclusions

8 Motivations Not all web sites are designed based on task analysis and user analysis. Faceted browsing? Maps view? Calendar view? Features are not implemented consistently across sites. Web browsers can provide a unified sorting/filtering interface. Not all users have exactly the same needs. No site can ever design for all users. Each web browser can tailor experience to its owner.

9 Motivations

10 Outline Motivations 1.User Interface Design Extraction Augmentation 2.Extraction Algorithm Evaluations 1.Extraction Algorithm 2.User Interface Design Conclusions

11 User Interface Design – Extraction Web content extraction is a system precondition poorly understood by users. If it doesn’t let me do this,… If the web site understands that this is the original price ( $8.99 ),… If I can see that this is a date (“last Christmas”),…

12 User Interface Design – Extraction Extraction is lengthy and error-prone. We explore UI potentials even in the face of fragile extraction. This lets us know which aspects of extraction should be improved first, and in which ways. We minimize the steps required to kick-start extraction. But we give the user an chance to make correction early.

13 UI Design - Extraction 1 st click preview of results controls for making correction 2 nd click if all goes well

14 Outline Motivations 1.User Interface Design Extraction Augmentation 2.Extraction Algorithm Evaluations 1.Extraction Algorithm 2.User Interface Design Conclusions

15 User Interface Design - Augmentation Novelty Presentation of data remains unchanged … except for a few asterisks. Presentation might be well-designed with domain specific knowledge, and worth to keep as-is. Semantics of the data are in the presentation. We want to maintain visual context. Filtering and sorting are supported without resorting to field names.

16 User Interface Design - Augmentation By keeping the original visual presentation of the data, and then applying automatic content extraction technology, we can provide additional functionalities without needing, trying, or pretending to understand the semantics of the data. format? binding? medium? who cares?!

17 … ssshhhh … Semantics is Overrated

18 Outline Motivations 1.User Interface Design Extraction Augmentation 2.Extraction Algorithm Evaluations 1.Extraction Algorithm 2.User Interface Design Conclusions

19 Extraction Algorithm Detection of 1.Items of interest 2.Subsequent pages 3.Fields within items

20 1.Items occupy most of the page area. 2.Each item contains links. Find THE set of similar links whose outer containers occupy the largest page area compared to other sets of links. Extraction Algorithm - Assumptions

21 BODY TABLE TR - item 1 DIV A TD TR - item 2 DIV A TD BODY TABLE TR TD DIV A TR TD DIV A Item 1Item 2

22 BODY TABLE TR - item 1 DIV A TD TR - item 2 DIV A TD BODY TABLE TR TD DIV A TR TD DIV A Item 1Item 2 A

23 BODY TABLE TR - item 1 DIV A TD TR - item 2 DIV A TD BODY TABLE TR TD DIV A TR TD DIV A Item 1Item 2 DIV/A

24 BODY TABLE TR - item 1 DIV A TD TR - item 2 DIV A TD BODY TABLE TR TD DIV A TR TD DIV A Item 1Item 2 TD/DIV/A

25 BODY TABLE TR - item 1 DIV A TD TR - item 2 DIV A TD BODY TABLE TR TD DIV A TR TD DIV A Item 1Item 2 TR/TD/DIV/A

26 BODY TABLE TR - item 1 DIV A TD TR - item 2 DIV A TD BODY TABLE TR TD DIV A TR TD DIV A Item 1Item 2 TABLE/TR/TD/DIV/A

27 BODY TABLE TR - item 1 DIV A TD TR - item 2 DIV A TD BODY TABLE TR TD DIV A TR TD DIV A Item 1Item 2 BODY/TABLE/TR/TD/DIV/A

28 BODY TABLE TR - item 1 DIV A TD TR - item 2 DIV A TD BODY TABLE TR TD DIV A TR TD DIV A Item 1Item 2 BODY/TABLE/TR/TD/DIV/A Found similar links!

29 BODY TABLE TR - item 1 DIV A TD TR - item 2 DIV A TD BODY TABLE TR TD DIV A TR TD DIV A Item 1Item 2 BODY/TABLE/TR/TD/DIV/A/..

30 BODY TABLE TR - item 1 DIV A TD TR - item 2 DIV A TD BODY TABLE TR TD DIV A TR TD DIV A Item 1Item 2 BODY/TABLE/TR/TD/DIV/A/../..

31 BODY TABLE TR - item 1 DIV A TD TR - item 2 DIV A TD BODY TABLE TR TD DIV A TR TD DIV A Item 1Item 2 BODY/TABLE/TR/TD/DIV/A/../../..

32 BODY TABLE TR - item 1 DIV A TD TR - item 2 DIV A TD BODY TABLE TR TD DIV A TR TD DIV A Item 1Item 2 BODY/TABLE/TR/TD/DIV/A/../../../..

33 BODY TABLE TR - item 1 DIV A TD TR - item 2 DIV A TD BODY TABLE TR TD DIV A TR TD DIV A Item 1Item 2 BODY/TABLE/TR/TD/DIV/A/../../.. Found one potential set of items!

34 Extraction Algorithm – Subsequent page detection

35 Extraction Algorithm – Subsequent page detection URL parameters ?... &page=2&... ?... &page=3&... ?... &page=4&...

36 Outline Motivations 1.User Interface Design Extraction Augmentation 2.Extraction Algorithm Evaluations 1.Extraction Algorithm 2.User Interface Design Conclusions

37 Evaluations – Extraction algorithm Test conducted over 30 web sites: Amazon, BestBuy, CNET Reviews, Froogle, Target, Walmart, … Item detection Items on 27 / 30 collections can be identified by xpaths (in the remaining 3, items consist of sibling/cousin nodes) … but only 24 / 27 were automatically detected Subsequent page detection For 22 / 27 collections, subsequent pages could be identified. For 19 / 22 collections, original numbers of items were recovered. Overall 19 / 30 = 63% accuracy We measure accuracy at the level of whole collections, not individual items.

38 Evaluations – User Interface Design Extraction algorithm is still fragile Formative evaluation of UI Is “web content extraction” too high a conceptual barrier? Is in-place sorting/filtering augmentation usable? No field name – usable? Is such augmentation useful?

39

40 Evaluations – User Interface Design Task 1: Structured This task lets subjects get familiar with the UI. No specific help or tutorial is provided. Subject follows a sequence of high-level instructions to ultimately perform a complex query. sort by price filter by date Subject is given 5 min to perform a similar query using the web site. Task 2: Unstructured Subject judges whether a sale of several products is good.

41 Evaluations – User Interface Design Task 1: Structured 8/8 subjects completed the task using our system. 5/8 … using the web site within 5 minutes. 1/8 knew about Amazon’s Advanced Search. All subjects were familiar with Amazon. A unified filtering/sorting UI can be more usable than different UIs on different sites. Task 2: Unstructured 7/8 subjects completed the task using our system. 1 refused to complete the task.

42 Evaluations – UI Design Survey responses indicate Our system is usable and useful … while it offers advanced functionalities.

43 Conclusions In our work, we … Preserve original presentation to leverage the semantics within it; Provide filter/sort functionalities without field names; Put automatic web content extraction technologies into the hands of end-users; Show evidence that it’s usable and useful. For future work, we will focus on … Error recovery; Merging data from several sites.

44 More information Firefox extension installation file Open source code + build instructions Links to video and user study data