Toward Large Scale Integration

Toward Large Scale Integration
Building a MetaQuerier over Databases on the Web 14/05/2019 Sagar Khushalani

What is ‘Deep Web’? Deep Web refers to World Wide Web content that is NOT part of the surface Web (i.e., not part of the web that is indexed by standard search engines). 14/05/2019 Sagar Khushalani

What is ‘Deep Web’? 14/05/2019 Sagar Khushalani

What is ‘Deep Web’? Terabytes of content hidden behind database interfaces Invisible to crawlers, and therefore invisible to users Difficulties: Find the right database(s) Query the database(s) 14/05/2019 Sagar Khushalani

Overview MetaQuerier Goals, Features and Challenges
System Architecture Database Crawler Interface Extraction Schema Matching (SM) Introduction to DCM Data Preparation Complex Matching Correlation Measure Putting it together Ensemble Feedback Conclusion 14/05/2019 Sagar Khushalani

Goals & Challenges Goals: Make the “deep web” accessible (find)
Make it usable (query) Challenges: What are the query capabilities of a source? How to mediate queries? Factors that help: Regulation across sources in the same domain New sources influenced by existing sources 14/05/2019 Sagar Khushalani

MetaQuerier: System Architecture
14/05/2019 Sagar Khushalani

Database Crawler Functionality:
Automatically find databases by identifying query interfaces Insight: Query interfaces are often found near the base (root) of the web page Approach: Site Collector Shallow Crawler 14/05/2019 Sagar Khushalani

Interface Extraction Functionality:
Extract query templates of given interface as a 3-tuple Insight: HTML query forms have a certain hidden syntax. This can be used to convert the problem to a text-parsing problem and use parsing trees. Approach: Authors use ‘visual language’ parsing 14/05/2019 Sagar Khushalani

Schema Matching: Functionality
Attributes from different interfaces need to be analysed Equivalent attributes need to be combined (query mediation) To find equivalent attributes or attribute groups, we need schema matching 14/05/2019 Sagar Khushalani

Schema Matching: Procedure

Schema Matching: Approach
Simple matching vs. Complex matching Current schema matching methods only compare two schemas at a time, and cannot work with groups E.g.: {passengers} = {adults, seniors, children, infants} Complex matching allows context information to be used Grouping Attributes & Synonym Attributes 14/05/2019 Sagar Khushalani

Motivation: Example User Amy wants to buy 2 tickets to fly from Anchorage to Baltimore – one for herself, one for her child Website 1 attributes: From, To, Number of passengers Website 2 attributes: Origin, Destination, Adults, Children, Infants, Seniors 14/05/2019 Sagar Khushalani

Example: Complex Matching
E.g.: {adults, seniors, children} = {number of tickets} {from} = {leaving from} Grouping = {adults, children, seniors} Synonym = {number of tickets} = {adults, seniors, children} 14/05/2019 Sagar Khushalani

Data Preparation HTML forms are not “minable”
Pre-processing: The data needs to be “prepared” Extracting the form Attribute type recognition Syntactic merging 14/05/2019 Sagar Khushalani

Data Preparation: Form Extraction
Read a webpage with query forms and extracts attribute names: Title of Book: <name = “title of book”, domain = any> Method: Parsing 14/05/2019 Sagar Khushalani

Data Preparation: Type Recognition
Confusion: “departing” can mean: City of departure Time of departure Entities are distinguished by both attribute name and type, thus avoiding confusion caused by homonyms Type Recognizer 14/05/2019 Sagar Khushalani

Data Preparation: Syntactic Merging
Name-based merging: Merge two attributes if they are similar in names Only merge when A is a variation of B (e.g. “title” and “title of book”) and B is more frequently used than A Domain-based merging: Merge if similar in domain values Only consider string-type attributes 14/05/2019 Sagar Khushalani

Correlation Measure What is correlation?
A testing/score based on the contingency table. Contingency Table Two types of correlation: Positive (Mp) Negative (Mn) 14/05/2019 Sagar Khushalani

Correlation Measure Negative Correlation: mn mn = H(Ap,Aq)
Positive Correlation: mp If f11/f++ < Td: mp = 1 – H(Ap,Aq) Else: mp = 0 Td is a threshold parameter Fig: Contingency table for Ap,Aq H(Ap,Aq) = (f01f10 / f+1f1+) 14/05/2019 Sagar Khushalani

Matching: Discovery Step I: Group Discovery:
Positively co-related attributes can form potential groups However, if the group has no negative correlations with other groups, it has no use for matching E.g.: {lastname,firstname} – if all sources have this group, there is no need for matching 14/05/2019 Sagar Khushalani

Matching: Discovery Step II: Complex Matching Discovery
Matching Discovery works on attribute groups Negative correlation should exist between synonym groups 14/05/2019 Sagar Khushalani

Matching: Ranking Each matching needs to be ranked – this allows comparison of matchings The rank of a matching is the maximal negative correlation of a pair of attribute groups in the matching Cmax(Mj,mn) = max mn (Gjr, Gjt) for all Gjr & Gjt where r != t The score, combined with semantic subsumption, allows matchings to be ranked 14/05/2019 Sagar Khushalani

Matching: Selection Step III: Complex Matching Selection
Complex Matching can create false matchings. E.g.: {author} = {first name, last name} & {subject} = {first name, last name} Why? These matchings must conflict: Solution: remove conflicts based on negative correlation 14/05/2019 Sagar Khushalani

Final Algorithm Find the highest ranked matching Mt in each iteration and add it to the set Remove matchings in the group that are inconsistent with M Return final set of matchings 14/05/2019 Sagar Khushalani

Putting it together As shown before, MetaQuerier consists of various subsystems The subsystems were developed concurrently. However, there are certain problems with this method: What level of accuracy is good enough? Is it possible to make a subsystem more accurate? Each subsystem depends on the previous ones: Is one subsystem’s accuracy enough for the next subsystem’s functions? 14/05/2019 Sagar Khushalani

Putting it together Observation
Joining the subsystems may require higher accuracy for each individual subsystem Information from later subsystems can be fed back to previous ones to increase accuracy Solution: To sustain accuracy of SM: Ensemble To improve accuracy of IE: Feedback 14/05/2019 Sagar Khushalani

Ensemble Procedure: Execute matcher on a smaller, random, sample of input schemas Each set of schemas is called a trial Run multiple matchers, where each matcher is executed over an independent trial of schemas 14/05/2019 Sagar Khushalani

Feedback: Domain Statistics
Observation: IE can take feedback from SM to improve its accuracy Such information that is passed back is obtained by voting between similar query interfaces, and is known as a domain statistic. There are three types of domain statistics: Type of attributes Frequency of attributes Correlation of attributes 14/05/2019 Sagar Khushalani

Feedback: Attribute Types
Type of attributes: Commonly occurring attributes will have a type that can be used for parsing. Eg. ISBN is a numeric attribute 14/05/2019 Sagar Khushalani

Feedback: Attribute Frequency
Frequency of attributes: Certain attributes are very common in a particular domain, eg. “last name” “Last Name” is a much more common attribute in a query interface than “e.g. Mike” This tells the SM system that “Last Name” is probably the right attribute for the text box 14/05/2019 Sagar Khushalani

Feedback: Correlation
Given that “adults” and “children” have a positive correlation, and both have a negative correlation with “passengers”. If “children” is definitely an identified attribute, then SM will choose “adults” to be the other attribute, ignoring “passengers” 14/05/2019 Sagar Khushalani

Feedback: Domain Statistics
How to combine these three rules (type, frequency, correlation)? The authors use the following strategy: > 2 -> 1 14/05/2019 Sagar Khushalani

Review What is MetaQuerier? System Architecture Database Crawler
Interface Extraction Schema Matching Data Preparation Correlation Measure Matching Putting it together – issues and solutions 14/05/2019 Sagar Khushalani

Questions? Sources: Attributed Multi-set Grammar: Grammar: Deep Web: Papers by the authors: 14/05/2019 Sagar Khushalani

Toward Large Scale Integration

Similar presentations

Presentation on theme: "Toward Large Scale Integration"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Toward Large Scale Integration

Similar presentations

Presentation on theme: "Toward Large Scale Integration"— Presentation transcript:

Similar presentations

About project

Feedback