Download presentation
Presentation is loading. Please wait.
1
Toward Large Scale Integration
Building a MetaQuerier over Databases on the Web 14/05/2019 Sagar Khushalani
2
What is ‘Deep Web’? Deep Web refers to World Wide Web content that is NOT part of the surface Web (i.e., not part of the web that is indexed by standard search engines). 14/05/2019 Sagar Khushalani
3
What is ‘Deep Web’? 14/05/2019 Sagar Khushalani
4
What is ‘Deep Web’? Terabytes of content hidden behind database interfaces Invisible to crawlers, and therefore invisible to users Difficulties: Find the right database(s) Query the database(s) 14/05/2019 Sagar Khushalani
5
Overview MetaQuerier Goals, Features and Challenges
System Architecture Database Crawler Interface Extraction Schema Matching (SM) Introduction to DCM Data Preparation Complex Matching Correlation Measure Putting it together Ensemble Feedback Conclusion 14/05/2019 Sagar Khushalani
6
Goals & Challenges Goals: Make the “deep web” accessible (find)
Make it usable (query) Challenges: What are the query capabilities of a source? How to mediate queries? Factors that help: Regulation across sources in the same domain New sources influenced by existing sources 14/05/2019 Sagar Khushalani
7
MetaQuerier: System Architecture
14/05/2019 Sagar Khushalani
8
MetaQuerier: System Architecture
14/05/2019 Sagar Khushalani
9
Database Crawler Functionality:
Automatically find databases by identifying query interfaces Insight: Query interfaces are often found near the base (root) of the web page Approach: Site Collector Shallow Crawler 14/05/2019 Sagar Khushalani
10
MetaQuerier: System Architecture
14/05/2019 Sagar Khushalani
11
Interface Extraction Functionality:
Extract query templates of given interface as a 3-tuple Insight: HTML query forms have a certain hidden syntax. This can be used to convert the problem to a text-parsing problem and use parsing trees. Approach: Authors use ‘visual language’ parsing 14/05/2019 Sagar Khushalani
12
MetaQuerier: System Architecture
14/05/2019 Sagar Khushalani
13
Schema Matching: Functionality
Attributes from different interfaces need to be analysed Equivalent attributes need to be combined (query mediation) To find equivalent attributes or attribute groups, we need schema matching 14/05/2019 Sagar Khushalani
14
Schema Matching: Procedure
14/05/2019 Sagar Khushalani
15
Schema Matching: Approach
Simple matching vs. Complex matching Current schema matching methods only compare two schemas at a time, and cannot work with groups E.g.: {passengers} = {adults, seniors, children, infants} Complex matching allows context information to be used Grouping Attributes & Synonym Attributes 14/05/2019 Sagar Khushalani
16
Motivation: Example User Amy wants to buy 2 tickets to fly from Anchorage to Baltimore – one for herself, one for her child Website 1 attributes: From, To, Number of passengers Website 2 attributes: Origin, Destination, Adults, Children, Infants, Seniors 14/05/2019 Sagar Khushalani
17
Example: Complex Matching
E.g.: {adults, seniors, children} = {number of tickets} {from} = {leaving from} Grouping = {adults, children, seniors} Synonym = {number of tickets} = {adults, seniors, children} 14/05/2019 Sagar Khushalani
18
Data Preparation HTML forms are not “minable”
Pre-processing: The data needs to be “prepared” Extracting the form Attribute type recognition Syntactic merging 14/05/2019 Sagar Khushalani
19
Data Preparation: Form Extraction
Read a webpage with query forms and extracts attribute names: Title of Book: <name = “title of book”, domain = any> Method: Parsing 14/05/2019 Sagar Khushalani
20
Data Preparation: Type Recognition
Confusion: “departing” can mean: City of departure Time of departure Entities are distinguished by both attribute name and type, thus avoiding confusion caused by homonyms Type Recognizer 14/05/2019 Sagar Khushalani
21
Data Preparation: Syntactic Merging
Name-based merging: Merge two attributes if they are similar in names Only merge when A is a variation of B (e.g. “title” and “title of book”) and B is more frequently used than A Domain-based merging: Merge if similar in domain values Only consider string-type attributes 14/05/2019 Sagar Khushalani
22
Correlation Measure What is correlation?
A testing/score based on the contingency table. Contingency Table Two types of correlation: Positive (Mp) Negative (Mn) 14/05/2019 Sagar Khushalani
23
Correlation Measure Negative Correlation: mn mn = H(Ap,Aq)
Positive Correlation: mp If f11/f++ < Td: mp = 1 – H(Ap,Aq) Else: mp = 0 Td is a threshold parameter Fig: Contingency table for Ap,Aq H(Ap,Aq) = (f01f10 / f+1f1+) 14/05/2019 Sagar Khushalani
24
Matching: Discovery Step I: Group Discovery:
Positively co-related attributes can form potential groups However, if the group has no negative correlations with other groups, it has no use for matching E.g.: {lastname,firstname} – if all sources have this group, there is no need for matching 14/05/2019 Sagar Khushalani
25
Matching: Discovery Step II: Complex Matching Discovery
Matching Discovery works on attribute groups Negative correlation should exist between synonym groups 14/05/2019 Sagar Khushalani
26
Matching: Ranking Each matching needs to be ranked – this allows comparison of matchings The rank of a matching is the maximal negative correlation of a pair of attribute groups in the matching Cmax(Mj,mn) = max mn (Gjr, Gjt) for all Gjr & Gjt where r != t The score, combined with semantic subsumption, allows matchings to be ranked 14/05/2019 Sagar Khushalani
27
Matching: Selection Step III: Complex Matching Selection
Complex Matching can create false matchings. E.g.: {author} = {first name, last name} & {subject} = {first name, last name} Why? These matchings must conflict: Solution: remove conflicts based on negative correlation 14/05/2019 Sagar Khushalani
28
Final Algorithm Find the highest ranked matching Mt in each iteration and add it to the set Remove matchings in the group that are inconsistent with M Return final set of matchings 14/05/2019 Sagar Khushalani
29
Putting it together As shown before, MetaQuerier consists of various subsystems The subsystems were developed concurrently. However, there are certain problems with this method: What level of accuracy is good enough? Is it possible to make a subsystem more accurate? Each subsystem depends on the previous ones: Is one subsystem’s accuracy enough for the next subsystem’s functions? 14/05/2019 Sagar Khushalani
30
Putting it together Observation
Joining the subsystems may require higher accuracy for each individual subsystem Information from later subsystems can be fed back to previous ones to increase accuracy Solution: To sustain accuracy of SM: Ensemble To improve accuracy of IE: Feedback 14/05/2019 Sagar Khushalani
31
Ensemble Procedure: Execute matcher on a smaller, random, sample of input schemas Each set of schemas is called a trial Run multiple matchers, where each matcher is executed over an independent trial of schemas 14/05/2019 Sagar Khushalani
32
Feedback: Domain Statistics
Observation: IE can take feedback from SM to improve its accuracy Such information that is passed back is obtained by voting between similar query interfaces, and is known as a domain statistic. There are three types of domain statistics: Type of attributes Frequency of attributes Correlation of attributes 14/05/2019 Sagar Khushalani
33
Feedback: Attribute Types
Type of attributes: Commonly occurring attributes will have a type that can be used for parsing. Eg. ISBN is a numeric attribute 14/05/2019 Sagar Khushalani
34
Feedback: Attribute Frequency
Frequency of attributes: Certain attributes are very common in a particular domain, eg. “last name” “Last Name” is a much more common attribute in a query interface than “e.g. Mike” This tells the SM system that “Last Name” is probably the right attribute for the text box 14/05/2019 Sagar Khushalani
35
Feedback: Correlation
Given that “adults” and “children” have a positive correlation, and both have a negative correlation with “passengers”. If “children” is definitely an identified attribute, then SM will choose “adults” to be the other attribute, ignoring “passengers” 14/05/2019 Sagar Khushalani
36
Feedback: Domain Statistics
How to combine these three rules (type, frequency, correlation)? The authors use the following strategy: > 2 -> 1 14/05/2019 Sagar Khushalani
37
Review What is MetaQuerier? System Architecture Database Crawler
Interface Extraction Schema Matching Data Preparation Correlation Measure Matching Putting it together – issues and solutions 14/05/2019 Sagar Khushalani
38
Questions? Sources: Attributed Multi-set Grammar: Grammar: Deep Web: Papers by the authors: 14/05/2019 Sagar Khushalani
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.