Download presentation
Presentation is loading. Please wait.
Published byCameron Higgins Modified over 8 years ago
1
Developing an Enquirer Carlos Rivero
2
Contents Deep Web Data Islands IntegraWeb Conclusions
3
Surface Web “Is that portion of the World Wide Web that is indexed by conventional search engines” Wikipedia
4
Deep Web “Refers to World Wide Web content that is not part of the surface Web indexed by search engines” Wikipedia
5
Deep Web in Google http://googlewebmastercentral.blogspot.com/2008/04/crawling-through-html-forms.html
6
“I think the Deep Web is…” http://www.brightplanet.com/images/stories/pdf/deepwebwhitepaper.pdf
7
“I think the Deep Web is…” HePZC07
8
Our objective SQL
9
Contents Deep Web Data Islands IntegraWeb Conclusions
10
Web data islands “Deep Web Applications” User friendly Very interactive Services on the web Loads of services Loads of providers Internet Navigator Keywords: Price: Search! ( ) 1-10 € (·) 10-20 € (·) Don’t care URL: http://www.books.com File Favourites Help
11
An perfect scenario User Interface Controller Business Logic Data Access Layer Data Layer Wrapping (Protocol-Oriented) Specific-purpose API Generic API (JDBC, XML, …)
12
A typical scenario User Interface Business Logic Data Access Layer Internet Navigator Keywords: Price: Search! ( ) 1-10 € (·) 10-20 € (·) Don’t care URL: http://www.books.com File Favourites Help Data Layer
13
Non-dismantleable Applications “Typical scenario” applications Monolithic Reengineering is not possible “Perfect scenario” applications Reengineering is not affordable
14
Contents Deep Web Data Islands IntegraWeb Conclusions
15
IntegraWeb in a nutshell Focus Web data islands Non-dismantleable Goal Help integrate them Help make adapters
16
VerifierOntologiser The IntegraWeb Architecture Knowledge Base ExtractorInformation retrieval Ontology Dataset
17
VerifierOntologiser The IntegraWeb Architecture Knowledge Base ExtractorInformation retrieval Ontology Dataset
18
Query Executor Setup Information retrieval Query analyzer Query Search Form Filler [Feasible] [Not Feasible] *(att,val) Result page Navigator Form Analyzer Form Model Form analyzer Bayesian Visual Hand-crafted Others Query analyzer Views Feasibility Filler Testers Navigator EzBuilder
19
Example Title: Price: Welcome to books.com! Vinci Search! ( ) 1-10 € (·) 10-20 € (X) Don’t care select *from Bookswhere Title like “%Vinci%” The Da Vinci Code Buy Dan Brown Doubleday, 2006 15.95 € Robert Langdon is a Harvard Professor of Symbology… 21 Da Vinci Code Myths Buy No author The Focal, 2006 19.95 € Few books have caused a stir like The Da Vinci Code… Query analyzer Filler Navigator Result list Next >> 1. The Da Vinci Code 2. 21 Da Vinci Code… 3....
20
Form analyzer Probabilistic approach (Kushmerick03) Three-layer Bayesian network. Domains: Pr[SearchBook]>>Pr[FindCollege] Datatypes: Pr[BookTitle|SearchBook] >> Pr[DestAirport|SearchBook] Terms: Pr[title|BookTitle] >> Pr[city|BookTitle] Search form classifier Terms annotation
21
Form analyzer Title: Price: Welcome to books.com! Vinci Search! ( ) 1-10 € (·) 10-20 € (X) Don’t care Welcome to books.com Title: … Tokenizer Title: Price: … Bayesian Classifier Domain Classifier Datatype Classifier SearchBook Title: Price: BookTitle BookPrice …
22
Form analyzer Proximity-based approach (Alvarez07) Heuristic (predefined) Visual distance Textual distance
23
Form analyzer
25
Proximity-based approach (Chang04) 2P-grammar (user-defined) Best-effort parser
26
Form analyzer Title: Price: Welcome to books.com! Vinci Search! ( ) 1-10 € (·) 10-20 € (X) Don’t care Welcome to books.com Title: … Tokenizer Token 1 Token 2 … Best-effort Parser Merger Query capabilities 2P-Grammar
27
Form analyzer Tokenization
28
Form analyzer 2P-Grammar & Best-Effort Parser
29
Form analyzer
30
Modeling manually (Pan02) Useful to check feasibility TITLE:String,AUTHOR:String,FORMAT:Enumerated(Hardcover, Paperback, eBooks),PUBLISHER: String,PRICE:Money NEGATIVE { (TITLE,Contains,1,ANY)}POSITIVE { (TITLE,Contains,+,ANY)(AUTHOR,Contains,+,ANY)(FORMAT,=,1,{‘Hardcover’,‘Paperback’, ‘eBooks’})(PRICE,<=,1,ANY)(PRICE,>=,1,ANY)}OUTPUT{TITLE,AUTHOR,FORMAT,PUBLISHER,PRICE}
31
Other proposals LageSGL04 Textual distance RaghavanG01 DOM tree HeMYW03 Heuristic + Bayesian network ModicaGJ01 Ontologies
32
Query analyzer Search forms are virtual views V1, …, Vn Feasibility Is it possible to answer Q using only V1, …, Vn? Papakonstantinou06 RED, YELLOW, BLUE and WHITE Guides user towards the formulation of feasible queries
33
Query analyzer
37
Fillers and navigators Fillers: Web application testers Watij, HTTPUnit, JMeter, … Fillers & Navigators: EZBuilder (www.fetch.com) 1.The user demonstrates how to obtain the required information 2.Constructs a model of the site. 3.Generalizes the learned model to automatically download detail pages
38
Contents Deep Web Data Islands IntegraWeb Conclusions
39
Comparative Kushmerick03Alvarez07Chang04Pan02 Core Bayesian network Visual heuristic Visual grammar + parser Hand- crafted Form field labels Form field semantics Operators Checking feasibility
40
Comparative Kushmerick03Alvarez07Chang04Pan02 Fields dependency Form steps Mandatory fields
41
Future work Web form model Rich & useful Form analyzer Hand-crafted Automatic Query analyzer Transform form into a view
42
Thanks! Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.