Presentation is loading. Please wait.

Presentation is loading. Please wait.

Developing an Enquirer Carlos Rivero. Contents Deep Web Data Islands IntegraWeb Conclusions.

Similar presentations


Presentation on theme: "Developing an Enquirer Carlos Rivero. Contents Deep Web Data Islands IntegraWeb Conclusions."— Presentation transcript:

1 Developing an Enquirer Carlos Rivero

2 Contents Deep Web Data Islands IntegraWeb Conclusions

3 Surface Web “Is that portion of the World Wide Web that is indexed by conventional search engines” Wikipedia

4 Deep Web “Refers to World Wide Web content that is not part of the surface Web indexed by search engines” Wikipedia

5 Deep Web in Google http://googlewebmastercentral.blogspot.com/2008/04/crawling-through-html-forms.html

6 “I think the Deep Web is…” http://www.brightplanet.com/images/stories/pdf/deepwebwhitepaper.pdf

7 “I think the Deep Web is…” HePZC07

8 Our objective SQL

9 Contents Deep Web Data Islands IntegraWeb Conclusions

10 Web data islands  “Deep Web Applications”  User friendly  Very interactive  Services on the web  Loads of services  Loads of providers Internet Navigator Keywords: Price: Search! ( ) 1-10 € (·) 10-20 € (·) Don’t care URL: http://www.books.com File Favourites Help

11 An perfect scenario User Interface Controller Business Logic Data Access Layer Data Layer Wrapping (Protocol-Oriented) Specific-purpose API Generic API (JDBC, XML, …)

12 A typical scenario User Interface Business Logic Data Access Layer Internet Navigator Keywords: Price: Search! ( ) 1-10 € (·) 10-20 € (·) Don’t care URL: http://www.books.com File Favourites Help Data Layer

13 Non-dismantleable Applications  “Typical scenario” applications  Monolithic  Reengineering is not possible  “Perfect scenario” applications  Reengineering is not affordable

14 Contents Deep Web Data Islands IntegraWeb Conclusions

15 IntegraWeb in a nutshell  Focus  Web data islands  Non-dismantleable  Goal  Help integrate them  Help make adapters

16 VerifierOntologiser The IntegraWeb Architecture Knowledge Base ExtractorInformation retrieval Ontology Dataset

17 VerifierOntologiser The IntegraWeb Architecture Knowledge Base ExtractorInformation retrieval Ontology Dataset

18 Query Executor Setup Information retrieval Query analyzer Query Search Form Filler [Feasible] [Not Feasible] *(att,val) Result page Navigator Form Analyzer Form Model  Form analyzer  Bayesian  Visual  Hand-crafted  Others  Query analyzer  Views  Feasibility  Filler  Testers  Navigator  EzBuilder

19 Example Title: Price: Welcome to books.com! Vinci Search! ( ) 1-10 € (·) 10-20 € (X) Don’t care select *from Bookswhere Title like “%Vinci%” The Da Vinci Code Buy Dan Brown Doubleday, 2006 15.95 € Robert Langdon is a Harvard Professor of Symbology… 21 Da Vinci Code Myths Buy No author The Focal, 2006 19.95 € Few books have caused a stir like The Da Vinci Code… Query analyzer Filler Navigator Result list Next >> 1. The Da Vinci Code 2. 21 Da Vinci Code… 3....

20 Form analyzer  Probabilistic approach (Kushmerick03)  Three-layer Bayesian network.  Domains: Pr[SearchBook]>>Pr[FindCollege]  Datatypes: Pr[BookTitle|SearchBook] >> Pr[DestAirport|SearchBook]  Terms: Pr[title|BookTitle] >> Pr[city|BookTitle]  Search form classifier  Terms annotation

21 Form analyzer Title: Price: Welcome to books.com! Vinci Search! ( ) 1-10 € (·) 10-20 € (X) Don’t care Welcome to books.com Title: … Tokenizer Title: Price: … Bayesian Classifier Domain Classifier Datatype Classifier SearchBook Title: Price: BookTitle BookPrice …

22 Form analyzer  Proximity-based approach (Alvarez07)  Heuristic (predefined)  Visual distance  Textual distance

23 Form analyzer

24

25  Proximity-based approach (Chang04)  2P-grammar (user-defined)  Best-effort parser

26 Form analyzer Title: Price: Welcome to books.com! Vinci Search! ( ) 1-10 € (·) 10-20 € (X) Don’t care Welcome to books.com Title: … Tokenizer Token 1 Token 2 … Best-effort Parser Merger Query capabilities 2P-Grammar

27 Form analyzer  Tokenization

28 Form analyzer  2P-Grammar & Best-Effort Parser

29 Form analyzer

30  Modeling manually (Pan02)  Useful to check feasibility TITLE:String,AUTHOR:String,FORMAT:Enumerated(Hardcover, Paperback, eBooks),PUBLISHER: String,PRICE:Money NEGATIVE { (TITLE,Contains,1,ANY)}POSITIVE { (TITLE,Contains,+,ANY)(AUTHOR,Contains,+,ANY)(FORMAT,=,1,{‘Hardcover’,‘Paperback’, ‘eBooks’})(PRICE,<=,1,ANY)(PRICE,>=,1,ANY)}OUTPUT{TITLE,AUTHOR,FORMAT,PUBLISHER,PRICE}

31 Other proposals  LageSGL04  Textual distance  RaghavanG01  DOM tree  HeMYW03  Heuristic + Bayesian network  ModicaGJ01  Ontologies

32 Query analyzer  Search forms are virtual views V1, …, Vn  Feasibility  Is it possible to answer Q using only V1, …, Vn?  Papakonstantinou06  RED, YELLOW, BLUE and WHITE  Guides user towards the formulation of feasible queries

33 Query analyzer

34

35

36

37 Fillers and navigators  Fillers: Web application testers Watij, HTTPUnit, JMeter, …  Fillers & Navigators: EZBuilder (www.fetch.com) 1.The user demonstrates how to obtain the required information 2.Constructs a model of the site. 3.Generalizes the learned model to automatically download detail pages

38 Contents Deep Web Data Islands IntegraWeb Conclusions

39 Comparative Kushmerick03Alvarez07Chang04Pan02 Core Bayesian network Visual heuristic Visual grammar + parser Hand- crafted Form field labels Form field semantics Operators Checking feasibility

40 Comparative Kushmerick03Alvarez07Chang04Pan02 Fields dependency Form steps Mandatory fields

41 Future work  Web form model  Rich & useful  Form analyzer  Hand-crafted  Automatic  Query analyzer  Transform form into a view

42 Thanks! Questions?


Download ppt "Developing an Enquirer Carlos Rivero. Contents Deep Web Data Islands IntegraWeb Conclusions."

Similar presentations


Ads by Google