Download presentation
Presentation is loading. Please wait.
Published byLynette Ophelia Sherman Modified over 9 years ago
1
Data Extraction and Integration from Imprecise Web Sources Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli Studi Roma Tre (Creative Commons License, see last slide)
2
Data-intensive websites
3
Website Data-intensive websites Database Template1 Template2 Template3 target
4
Flint goal … StockQuote LastMinMax Volume52highOpen
5
Flint System architecture Web Search [WIDM08] Data Extraction Data Integration The Web
6
Novel contribution Unsupervised Automatic Scalable No knowledge available Data Extraction RoadRunner [Vldb01] ExAlg [Sigmod03] TurboWrapper [Vldb07] Unsupervised Automatic Scalable Uncertain Data No labels available No corpus available Data Integration WebTables [Vldb08] Cimple [Vldb07] MetaQuerier [Cidr05] PayGo [Cidr07]
7
Data Extraction
9
AAPL, GOOG, MSFT, INTC, …128.09, 439.54, 34.89, 112.37, … 127.81, 439.25, 32.13, 111.01, …132.43, 443.82, 33.67, 114.32, … 0.50%, -0.38%, 1.23%, 3.92%, -1.65%, … Add AAPL to Your Portfolio, Add GOOG to Your Portfolio, Add MSFT to Your Portfolio, Add INTC to Your Portfolio, … …
10
Data Extraction HTML fragments taken from two pages belonging to the same website: 1,132,228, 1,735,857 /html/body/table/tr[1]/td[2] $20.66, $414.58 /html/body/table/tr[2]/td[2] $11.70, $247.30 /html/body/table/tr[3]/td[2] $20.72, $414.06 /html/body/table/tr[4]/td[2] Extraction error! $0.02, 99,494,200 /html/body/table/tr[5]/td[2] ? 4,732,600, null /html/body/table/tr[6]/td[2]
11
Data Integration 10 33 16 (max) 4 25 10 (min) AA GO MS (stock)
12
Data Integration 10 33 16 (max) 4 25 10 (min) AA GO MS (stock) t=0.5
13
Data Integration 10 33 16 (max) 4 25 10 (min) AA GO MS (stock) 10 33 16 (max) 4 25 10 (min) AA GO MS (stock) t=0.5 1.0
14
Data Integration 10 33 16 (max) 4 25 10 (min) AA GO MS (stock) 10 33 16 (max) 4 25 10 (min) AA GO MS (stock) t=0.5
15
Data Integration 10 33 16 (max) 4 25 10 (min) AA GO MS (stock) 10 33 16 (max) 4 25 10 (min) AA GO MS (stock) t=0.5 6 26 12 (price) 4 25 10 (min) AA GO MS (stock) 0.6 1.0
16
Data Integration 10 33 16 (max) 4 25 10 (min) AA GO MS (stock) 10 33 16 (max) 4 25 10 (min) AA GO MS (stock) t=0.5 6 26 12 (price) 4 25 10 (min) AA GO MS (stock) ? 1.0
17
Data Integration 10 33 16 (max) 4 25 10 (min) AA GO MS (stock) 10 33 16 (max) 4 25 10 (min) AA GO MS (stock) t=0.5 6 26 12 (price) 4 25 10 (min) AA GO MS (stock) 1.0
18
t=0.7 Data Integration 10 33 16 (max) 4 25 10 (min) AA GO MS (stock) 10 33 16 (max) 4 25 10 (min) AA GO MS (stock) t=0.5 6 26 12 (price) 4 25 10 (min) AA GO MS (stock) 1.0
19
t=0.7 Data Integration 10 33 16 (max) 4 25 10 (min) AA GO MS (stock) 10 33 16 (max) 4 25 10 (min) AA GO MS (stock) t=0.5 6 26 12 (price) 4 25 10 (min) AA GO MS (stock)
20
t=0.7 Wrapper Refinement 10 33 16 (max) 4 25 10 (min) AA GO MS (stock) 10 33 16 (max) 4 25 10 (min) AA GO MS (stock) t=0.5 6 26 12 (price) 4 25 10 (min) AA GO MS (stock) 10 null 10 (min/max) ?? 0.3 (weak) 0.0
21
Wrapper Refinement matching value nearby template tokens //td[contains(text(),‘Open')]/../td[2] //td[contains(text(),‘Open')]/../../tr[5]/td[1] //td[contains(text(),‘Open')]/../../tr[5]/td[2] //td[contains(text(),‘High')]/../td[2] …
22
t=0.7 Wrapper Refinement 10 33 16 (max) 4 25 10 (min) AA GO MS (stock) 10 33 16 (max) 4 25 10 (min) AA GO MS (stock) t=0.5 6 26 12 (price) 4 25 10 (min) AA GO MS (stock) 10 null 10 (min/max) 1.0 10 33 16 (max) 4 25 10 (min) //td[contains(text(),‘Max')]/../td[2] //td[contains(text(),‘Min')]/../td[2]
23
t=0.7 Wrapper Refinement 10 33 16 (max) 4 25 10 (min) AA GO MS (stock) 10 33 16 (max) 4 25 10 (min) AA GO MS (stock) t=0.5 6 26 12 (price) 4 25 10 (min) AA GO MS (stock) 10 null 10 (min/max) 10 33 16 (max) 4 25 10 (min)
24
Experimental Results (100 websites for each domain) Soccer domain (45,714 pages) Attribute|m| Name90 Birth Date61 Height54 Nationality48 Club43 Position43 Weight 34 League14 Videogame domain (49,262 pages) Attribute|m| Title86 Publisher59 Developer45 Genre28 ESRB rating40 Release Date9 Platform9 # Players6 Finance domain (57,623 pages) Attribute|m| Stock Symbol84 Price Change73 % Change73 Volume52 Day Low43 Day High41 Last Price29 Open Price24
25
Demo Found Websites Integrated Data
26
the end! http://flint.dia.uniroma3.it
27
License This work is licensed under the Creative Commons Attribution-ShareAlike License. To view a copy of this license, visit http://creativecommons.org/licenses/by- sa/1.0/ or send a letter to Creative Commons, 559 Nathan Abbott Way, Stanford, California 94305, USA. http://creativecommons.org/licenses/by- sa/1.0/
28
Flint System architecture Web search Extraction Integration Probability The Web
29
Flint goal … 20.6420.4920.88 v P(v) Apple price? 20.5820.5920.6020.5720.5620.5520.5420.5320.5220.6120.62
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.