Toward Large Scale Integration

1 Toward Large Scale Integration
Building a MetaQuerier over Databases on the Web

What is 'Deep Web'? Deep Web refers to World Wide Web content that is NOT part of the surface Web (i.e., not part of the web that is indexed by standard search engines).

What is 'Deep Web'?

What is 'Deep Web'? Terabytes of content hidden behind database interfaces Invisible to crawlers, and therefore invisible to users Difficulties: Find the right database(s) Query the database(s)

5 Overview MetaQuerier Goals, Features and Challenges
System Architecture Database Crawler Interface Extraction Schema Matching (SM) Introduction to DCM Data Preparation Complex Matching Correlation Measure Putting it together Ensemble Feedback Conclusion

6 Goals & Challenges Goals: Make the “deep web” accessible (find)
Make it usable (query) Challenges: What are the query capabilities of a source? How to mediate queries? Factors that help: Regulation across sources in the same domain New sources influenced by existing sources

7 MetaQuerier: System Architecture
14/05/2019 Sagar Khushalani

8 MetaQuerier: System Architecture
14/05/2019 Sagar Khushalani

9 Database Crawler Functionality:
Automatically find databases by identifying query interfaces Insight: Query interfaces are often found near the base (root) of the web page Approach: Site Collector Shallow Crawler

10 MetaQuerier: System Architecture
14/05/2019 Sagar Khushalani

11 Interface Extraction Functionality:
Extract query templates of given interface as a 3-tuple Insight: HTML query forms have a certain hidden syntax. This can be used to convert the problem to a text-parsing problem and use parsing trees. Approach: Authors use 'visual language' parsing

12 MetaQuerier: System Architecture
14/05/2019 Sagar Khushalani

13 Schema Matching: Functionality
Attributes from different interfaces need to be analysed Equivalent attributes need to be combined (query mediation) To find equivalent attributes or attribute groups, we need schema matching

14 Schema Matching: Procedure
14/05/2019 Sagar Khushalani

15 Schema Matching: Approach
Simple matching vs. Complex matching Current schema matching methods only compare two schemas at a time, and cannot work with groups E.g.: {passengers} = {adults, seniors, children, infants} Complex matching allows context information to be used Grouping Attributes & Synonym Attributes

Motivation: Example User Amy wants to buy 2 tickets to fly from Anchorage to Baltimore – one for herself, one for her child Website 1 attributes: From, To, Number of passengers Website 2 attributes: Origin, Destination, Adults, Children, Infants, Seniors

17 Example: Complex Matching
E.g.: {adults, seniors, children} = {number of tickets} {from} = {leaving from} Grouping = {adults, children, seniors} Synonym = {number of tickets} = {adults, seniors, children}

18 Data Preparation HTML forms are not “minable”
Pre-processing: The data needs to be "prepared" Extracting the form Attribute type recognition Syntactic merging

19 Data Preparation: Form Extraction
Read a webpage with query forms and extracts attribute names: Title of Book: <name = "title of book", domain = any> Method: Parsing

20 Data Preparation: Type Recognition
Confusion: "departing" can mean: City of departure Time of departure Entities are distinguished by both attribute name and type, thus avoiding confusion caused by homonyms Type Recognizer

21 Data Preparation: Syntactic Merging
Name-based merging: Merge two attributes if they are similar in names Only merge when A is a variation of B (e.g. "title" and "title of book") and B is more frequently used than A Domain-based merging: Merge if similar in domain values Only consider string-type attributes

22 Correlation Measure What is correlation?
A testing/score based on the contingency table. Contingency Table Two types of correlation: Positive (Mp) Negative (Mn)

23 Correlation Measure Negative Correlation: mn mn = H(Ap,Aq)
Positive Correlation: mp If f11/f++ < Td: mp = 1 – H(Ap,Aq) Else: mp = 0 Td is a threshold parameter Fig: Contingency table for Ap,Aq H(Ap,Aq) = (f01f10 / f+1f1+)

24 Matching: Discovery Step I: Group Discovery:
Positively co-related attributes can form potential groups However, if the group has no negative correlations with other groups, it has no use for matching E.g.: {lastname,firstname} – if all sources have this group, there is no need for matching

25 Matching: Discovery Step II: Complex Matching Discovery
Matching Discovery works on attribute groups Negative correlation should exist between synonym groups

Each matching needs to be ranked – this allows comparison of matchings The rank of a matching is the maximal negative correlation of a pair of attribute groups in the matching Cmax(Mj,mn) = max mn (Gjr, Gjt) for all Gjr & Gjt where r != t The score, combined with semantic subsumption, allows matchings to be ranked

27 Matching: Selection Step III: Complex Matching Selection
Complex Matching can create false matchings. E.g.: {author} = {first name, last name} & {subject} = {first name, last name} Why? These matchings must conflict: Solution: remove conflicts based on negative correlation

Final Algorithm Find the highest ranked matching Mt in each iteration and add it to the set Remove matchings in the group that are inconsistent with M Return final set of matchings

Putting it together As shown before, MetaQuerier consists of various subsystems The subsystems were developed concurrently. However, there are certain problems with this method: What level of accuracy is good enough? Is it possible to make a subsystem more accurate? Each subsystem depends on the previous ones: Is one subsystem's accuracy enough for the next subsystem's functions?

30 Putting it together Observation
Joining the subsystems may require higher accuracy for each individual subsystem Information from later subsystems can be fed back to previous ones to increase accuracy Solution: To sustain accuracy of SM: Ensemble To improve accuracy of IE: Feedback

Ensemble Procedure: Execute matcher on a smaller, random, sample of input schemas Each set of schemas is called a trial Run multiple matchers, where each matcher is executed over an independent trial of schemas

32 Feedback: Domain Statistics
Observation: IE can take feedback from SM to improve its accuracy Such information that is passed back is obtained by voting between similar query interfaces, and is known as a domain statistic. There are three types of domain statistics: Type of attributes Frequency of attributes Correlation of attributes

33 Feedback: Attribute Types
Type of attributes: Commonly occurring attributes will have a type that can be used for parsing. Eg. ISBN is a numeric attribute

34 Feedback: Attribute Frequency
Frequency of attributes: Certain attributes are very common in a particular domain, eg. "last name" "Last Name" is a much more common attribute in a query interface than "e.g. Mike" This tells the SM system that "Last Name" is probably the right attribute for the text box

35 Feedback: Correlation
Given that "adults" and "children" have a positive correlation, and both have a negative correlation with "passengers". If "children" is definitely an identified attribute, then SM will choose "adults" to be the other attribute, ignoring "passengers"

36 Feedback: Domain Statistics
How to combine these three rules (type, frequency, correlation)? The authors use the following strategy: > 2 -> 1

37 Review What is MetaQuerier? System Architecture Database Crawler
Interface Extraction Schema Matching Data Preparation Correlation Measure Matching Putting it together – issues and solutions

Questions? Sources: Attributed Multi-set Grammar: Grammar: Deep Web: Papers by the authors:

