Toward Large Scale Integration Building a MetaQuerier over Databases on the Web 14/05/2019 Sagar Khushalani
What is ‘Deep Web’? Deep Web refers to World Wide Web content that is NOT part of the surface Web (i.e., not part of the web that is indexed by standard search engines). 14/05/2019 Sagar Khushalani
What is ‘Deep Web’? 14/05/2019 Sagar Khushalani
What is ‘Deep Web’? Terabytes of content hidden behind database interfaces Invisible to crawlers, and therefore invisible to users Difficulties: Find the right database(s) Query the database(s) 14/05/2019 Sagar Khushalani
Overview MetaQuerier Goals, Features and Challenges System Architecture Database Crawler Interface Extraction Schema Matching (SM) Introduction to DCM Data Preparation Complex Matching Correlation Measure Putting it together Ensemble Feedback Conclusion 14/05/2019 Sagar Khushalani
Goals & Challenges Goals: Make the “deep web” accessible (find) Make it usable (query) Challenges: What are the query capabilities of a source? How to mediate queries? Factors that help: Regulation across sources in the same domain New sources influenced by existing sources 14/05/2019 Sagar Khushalani
MetaQuerier: System Architecture 14/05/2019 Sagar Khushalani
MetaQuerier: System Architecture 14/05/2019 Sagar Khushalani
Database Crawler Functionality: Automatically find databases by identifying query interfaces Insight: Query interfaces are often found near the base (root) of the web page Approach: Site Collector Shallow Crawler 14/05/2019 Sagar Khushalani
MetaQuerier: System Architecture 14/05/2019 Sagar Khushalani
Interface Extraction Functionality: Extract query templates of given interface as a 3-tuple Insight: HTML query forms have a certain hidden syntax. This can be used to convert the problem to a text-parsing problem and use parsing trees. Approach: Authors use ‘visual language’ parsing 14/05/2019 Sagar Khushalani
MetaQuerier: System Architecture 14/05/2019 Sagar Khushalani
Schema Matching: Functionality Attributes from different interfaces need to be analysed Equivalent attributes need to be combined (query mediation) To find equivalent attributes or attribute groups, we need schema matching 14/05/2019 Sagar Khushalani
Schema Matching: Procedure 14/05/2019 Sagar Khushalani
Schema Matching: Approach Simple matching vs. Complex matching Current schema matching methods only compare two schemas at a time, and cannot work with groups E.g.: {passengers} = {adults, seniors, children, infants} Complex matching allows context information to be used Grouping Attributes & Synonym Attributes 14/05/2019 Sagar Khushalani
Motivation: Example User Amy wants to buy 2 tickets to fly from Anchorage to Baltimore – one for herself, one for her child Website 1 attributes: From, To, Number of passengers Website 2 attributes: Origin, Destination, Adults, Children, Infants, Seniors 14/05/2019 Sagar Khushalani
Example: Complex Matching E.g.: {adults, seniors, children} = {number of tickets} {from} = {leaving from} Grouping = {adults, children, seniors} Synonym = {number of tickets} = {adults, seniors, children} 14/05/2019 Sagar Khushalani
Data Preparation HTML forms are not “minable” Pre-processing: The data needs to be “prepared” Extracting the form Attribute type recognition Syntactic merging 14/05/2019 Sagar Khushalani
Data Preparation: Form Extraction Read a webpage with query forms and extracts attribute names: Title of Book: <name = “title of book”, domain = any> Method: Parsing 14/05/2019 Sagar Khushalani
Data Preparation: Type Recognition Confusion: “departing” can mean: City of departure Time of departure Entities are distinguished by both attribute name and type, thus avoiding confusion caused by homonyms Type Recognizer 14/05/2019 Sagar Khushalani
Data Preparation: Syntactic Merging Name-based merging: Merge two attributes if they are similar in names Only merge when A is a variation of B (e.g. “title” and “title of book”) and B is more frequently used than A Domain-based merging: Merge if similar in domain values Only consider string-type attributes 14/05/2019 Sagar Khushalani
Correlation Measure What is correlation? A testing/score based on the contingency table. Contingency Table Two types of correlation: Positive (Mp) Negative (Mn) 14/05/2019 Sagar Khushalani
Correlation Measure Negative Correlation: mn mn = H(Ap,Aq) Positive Correlation: mp If f11/f++ < Td: mp = 1 – H(Ap,Aq) Else: mp = 0 Td is a threshold parameter Fig: Contingency table for Ap,Aq H(Ap,Aq) = (f01f10 / f+1f1+) 14/05/2019 Sagar Khushalani
Matching: Discovery Step I: Group Discovery: Positively co-related attributes can form potential groups However, if the group has no negative correlations with other groups, it has no use for matching E.g.: {lastname,firstname} – if all sources have this group, there is no need for matching 14/05/2019 Sagar Khushalani
Matching: Discovery Step II: Complex Matching Discovery Matching Discovery works on attribute groups Negative correlation should exist between synonym groups 14/05/2019 Sagar Khushalani
Matching: Ranking Each matching needs to be ranked – this allows comparison of matchings The rank of a matching is the maximal negative correlation of a pair of attribute groups in the matching Cmax(Mj,mn) = max mn (Gjr, Gjt) for all Gjr & Gjt where r != t The score, combined with semantic subsumption, allows matchings to be ranked 14/05/2019 Sagar Khushalani
Matching: Selection Step III: Complex Matching Selection Complex Matching can create false matchings. E.g.: {author} = {first name, last name} & {subject} = {first name, last name} Why? These matchings must conflict: Solution: remove conflicts based on negative correlation 14/05/2019 Sagar Khushalani
Final Algorithm Find the highest ranked matching Mt in each iteration and add it to the set Remove matchings in the group that are inconsistent with M Return final set of matchings 14/05/2019 Sagar Khushalani
Putting it together As shown before, MetaQuerier consists of various subsystems The subsystems were developed concurrently. However, there are certain problems with this method: What level of accuracy is good enough? Is it possible to make a subsystem more accurate? Each subsystem depends on the previous ones: Is one subsystem’s accuracy enough for the next subsystem’s functions? 14/05/2019 Sagar Khushalani
Putting it together Observation Joining the subsystems may require higher accuracy for each individual subsystem Information from later subsystems can be fed back to previous ones to increase accuracy Solution: To sustain accuracy of SM: Ensemble To improve accuracy of IE: Feedback 14/05/2019 Sagar Khushalani
Ensemble Procedure: Execute matcher on a smaller, random, sample of input schemas Each set of schemas is called a trial Run multiple matchers, where each matcher is executed over an independent trial of schemas 14/05/2019 Sagar Khushalani
Feedback: Domain Statistics Observation: IE can take feedback from SM to improve its accuracy Such information that is passed back is obtained by voting between similar query interfaces, and is known as a domain statistic. There are three types of domain statistics: Type of attributes Frequency of attributes Correlation of attributes 14/05/2019 Sagar Khushalani
Feedback: Attribute Types Type of attributes: Commonly occurring attributes will have a type that can be used for parsing. Eg. ISBN is a numeric attribute 14/05/2019 Sagar Khushalani
Feedback: Attribute Frequency Frequency of attributes: Certain attributes are very common in a particular domain, eg. “last name” “Last Name” is a much more common attribute in a query interface than “e.g. Mike” This tells the SM system that “Last Name” is probably the right attribute for the text box 14/05/2019 Sagar Khushalani
Feedback: Correlation Given that “adults” and “children” have a positive correlation, and both have a negative correlation with “passengers”. If “children” is definitely an identified attribute, then SM will choose “adults” to be the other attribute, ignoring “passengers” 14/05/2019 Sagar Khushalani
Feedback: Domain Statistics How to combine these three rules (type, frequency, correlation)? The authors use the following strategy: 3 -> 2 -> 1 14/05/2019 Sagar Khushalani
Review What is MetaQuerier? System Architecture Database Crawler Interface Extraction Schema Matching Data Preparation Correlation Measure Matching Putting it together – issues and solutions 14/05/2019 Sagar Khushalani
Questions? Sources: Attributed Multi-set Grammar: http://portal.acm.org/citation.cfm?id=864839 Grammar: http://en.wikipedia.org/wiki/Grammar_(computer_science)#Formal_definition Deep Web: http://en.wikipedia.org/wiki/Deep_Web Papers by the authors: http://eagle.cs.uiuc.edu/pubs/2004/parsing-sigmod04-zhc-mar04.pdf http://eagle.cs.uiuc.edu/pubs/2003/unifiedschema-sigmod03-hc-mar03.pdf http://eagle.cs.uiuc.edu/pubs/2004/dwsurvey-sigmodrecord-chlpz-aug04.pdf http://eagle.cs.uiuc.edu/pubs/2004/complexmatching-sigkdd04-hch-jun04.pdf 14/05/2019 Sagar Khushalani