Download presentation
Presentation is loading. Please wait.
Published byTracy Owens Modified over 8 years ago
1
Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach Bin He Joint work with: Kevin Chen-Chuan Chang, Jiawei Han Univ. Illinois at Urbana-Champaign
2
MetaQuerier 2 Context: MetaQuerier Large-scale integration of the deep Web MetaQuerier QueryResult The Deep Web
3
MetaQuerier 3 Challenge: Matching query interfaces (QIs) Book Domain Music Domain m:n complex matching 1:1 simple matching
4
MetaQuerier 4 Demo.
5
MetaQuerier 5 Traditional approaches of schema matching – Pairwise attribute correspondence But, scale is a challenge… How to address the challenge of large scale? And, scale is an opportunity! How to leverage the opportunity of large scale? Pairwise Attribute Correspondence S1.author S2.name S1.subject S2.category S1: author title subject ISBN S2: name title category format Pairwise Matching
6
MetaQuerier 6 A holistic schema matching paradigm Holistic Schema Matching S2: writer title category format S3: name title keyword binding S1: author title subject ISBN Input: Set of schemas Output: Semantic model, for all attribute matchings author = name = writer subject = category format = binding
7
MetaQuerier 7 Holistic matching is, in essence– Data mining to discover semantics for information integration Semantics (semantic correspondences) Observations (attribute occurrences) Hidden Regularities Statistical Analysis -- for Model Discovery Generation Our Hypothesis Our Approach
8
MetaQuerier 8 Regularity: Co-occurrence patterns Author{ Last NameFirst Name } =, Grouping Attributes Synonym Attributes (a) amazon.com(b) www.randomhouse.com (d) 1bookstreet.com (c) bn.com
9
MetaQuerier 9 Schema matching as correlation mining Across many sources: Synonym attributes with negative correlation synonym attributes are semantically alternative thus, rarely co-occur in query interfaces Grouping attributes with positive correlation grouping attributes are semantically complement thus, often co-occur in query interfaces
10
MetaQuerier 10 Data preparation: Prepare schema transactions to be mined Interface Extraction [SIGMOD’04] Type Recognition Type is not declared in Web interfaces Identify types from instance values, e.g., integer, datetime Used for constraining merging and matching Syntactic Merging merge attributes with syntactically similar names e.g., title of book to title, author’s name to author merge attributes with syntactically similar instance values attributeoperatorvalue
11
MetaQuerier 11 DCM: Dual Correlation Mining framework 1. Positive correlation mining as potential groups 2. Negative correlation mining as potential matchings Mining positive correlations Last Name (any), First Name (any) Mining negative correlations Author (any) = {Last Name (any), First Name (any)} ISBN (any) = {Last Name (any), First Name (any)} 3. Matching selection as model construction Author (any) = {Last Name (any), First Name (any)} Subject (string) = Category (string) Format (string) = Binding (string)
12
MetaQuerier 12 Correlation measure for qualification To find groups and matchings that pass the correlation threshold Observation: Pairwise correlations e.g., in Airfares domain, to = arrival city = destination to and arrival city are negatively correlated to and destiation are negatively correlated arrival city and destination are negatively correlated Measure: m: some correlation measure for two items support downward closure --- enable Apriori algorithm accommodate different measure m C min = min m(A i, A j ), for all i <> j
13
MetaQuerier 13 The mining process – A standard Apriori algorithm Departure City Destination …. From To … Departure City Arrival City … Schema Transactions Destination = To Destination = Arrival City To = Arrival City Departure City = From … …. … Destination = To = Arrival City … …. Correlated items with length 2 Correlated items with length 3
14
MetaQuerier 14 Correlation measures for ranking To rank and select matchings in model construction Qualification measure is not good for ranking a set cannot win its subset due to the downward closure e.g., min({1, 2, 3}) < min({2, 3}) superset contains more matchings and should be preferred Ranking measure: A set doest not win its superset When tie, breaking the tie by semantic richness A 1 = A 2 = A 3 is semantically richer than A 1 = A 2 A 1 = {A 2, A 3 } is semantically richer than A 1 = A 2 C max = max m(A i, A j ), for all i <> j
15
MetaQuerier 15 Choosing the m --- Measuring the correlation of two items Contingency table We explore 22 measures, e.g., Lift = f 00 f 11 /(f 01 f 10 ) Jaccard = f 11 /(f 11 +f 01 +f 10 )
16
MetaQuerier 16 Choosing the m --- The problems of existing measures Co-presence (f 11 ) is more important than co-absence (f 00 ) Less positive correlation but a higher Lift = 17 More positive correlation but a lower Lift = 0.69 Rare attributes are not statistically convincing A p as rare attributes and Jaccard = 0.02 No rare attributes and Jaccard = 0.02
17
MetaQuerier 17 Choosing the m --- H -measure H-measure H = f 01 f 10 /(f +1 f 1+ ) Ignore the co-absence Less positive correlation H = 0.25 More positive correlation H = 0.07 Differentiate the subtlety of negative correlations A p as rare attributes and H = 0.49 No rare attributes and H = 0.92
18
MetaQuerier 18 Experimental setup 447 deep Web sources in 8 domains Domains Travel: Airfares, Hotels, Car Rentals Entertainment: Books, Movies, Music Records Living: Jobs, Automobiles Available as the TEL-8 dataset in UIUC Web Integration Repository http://metaquerier.cs.uiuc.edu/repository/
19
MetaQuerier 19 Results in Books and Airfares domains Books author (any) = {last name (any), first name (any)} subject (string) = category (string) format (string) = binding (string) Airfares passenger (integer) = {adult (integer), child (integer), infant (integer)} from (string) = departure city (string) = depart (string) departure date (datetime) = depart (datetime) return date (datetime) = return (datetime) class (string) = cabin (string) destination (string) = to (string) = {departure city (string), arrival city (string)}
20
MetaQuerier 20 Contributions Insight We build a conceptually novel connection between data integration and correlation mining schema matching as a new application of correlation mining correlation mining as a new approach for schema matching Techniques The dual correlation mining framework Measures for qualification and ranking H-measure, robust for negative correlations
21
MetaQuerier 21 Thank You!
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.