Robert McCann University of Illinois Joint work with Bedoor AlShebli, Quoc Le, Hoa Nguyen, Long Vu, & AnHai Doan VLDB 2005 Mapping Maintenance for Data Integration Systems
Data Integration Systems mediated schema windermere.com source schema 2 yahoo.com wrapper homeseekers.com wrapper source schema 3source schema 1 Find homes under $300K
Mapping Maintenance is a Key Bottleneck Constructing mappings has proven difficult… –(see first speaker) …but maintenance often quickly dominates cost E.g., Integrated Genome Database Project [Stein, 03] –12 genomic databases, each remodeled data twice per year –System broke every two weeks, abandoned after 1 year E.g., Integration Project at Illinois –Integrated 400 DB researcher homepages –2 system administrators, stopped after 3 months Reducing maintenance costs is now crucial!
Problem Definition 5 weeks later (source has changed) cost | city | numbeds | numbaths price location beds baths $185,000 “Urbana, IL” 2 2 $270,000 “Seattle, WA” 3 2 homeseekers.com wrapper cost | city | numbeds | numbaths price location beds baths $180, $260, homeseekers.com wrapper ? mediated schema
Example 1: Change Source Schema or Data Update tuples Change units of price homeseekers.com wrapper price location beds baths 185 “Urbana, IL” “Seattle, WA” 3 2 homeseekers.com wrapper cost | city | numbeds | numbaths homeseekers.com wrapper price location beds baths $185,000 “Urbana, IL” 2 2 $270,000 “Seattle, WA” 3 2 price location beds baths $180,000 “Urbana, IL” 2 2 $260,000 “Seattle, WA” 3 2
Example 2: Change Presentation Format cost | city | numbeds | numbaths homeseekers.com wrapper Display location as zipcode $185,000 Urbana, IL 2bed/2bath Century 21 homeseekers.com wrapper Rearrange page layout homeseekers.com wrapper $185,000 - Urbana, IL 2bed/2bath Century 21 $185, bed/2bath Century 21 price location beds baths $185,000 “Urbana, IL” 2 2 $270,000 “Seattle, WA” 3 2 price location beds baths $185, $270, price location beds baths $185,000 “Century 21” 2 2 $270,000 “RE/MAX” 3 2
Suppose administrator wants to maintain mappings for 1 year 1. For a short initial period (e.g., 5 weeks) –Administrator manually verifies each mapping –MAVERIC probes the source to learn data characteristics 2. For remaining time (e.g., 47 weeks) –MAVERIC probes the source to observe new data instances –MAVERIC outputs an alarm if characteristics differ –If an alarm, administrator repairs mappings The MAVERIC Approach
Example Training phase Verification phase Learned data characteristics homeseekers.com on week 1 wrapper homeseekers.com on week 5 wrapper price location beds baths 132 “Century 21” “RE/MAX” 2 4 homeseekers.com on week 6 wrapper If average price < 100,000, output alarm If layout of attributes changes, output alarm If beds < baths, output alarm price location beds baths $185,000 “Urbana, IL” 2 2 $270,000 “Seattle, WA” 3 2 price location beds baths $132,000 “Salem, OR” 2 1 $365,000 “Atlanta, GA” 4 2
Contributions Develop core MAVERIC system –An ensemble of sensors that exploit multiple characteristics of data –A combiner that leverages the most effective sensors Significantly improve core system –Generate synthetic data to improve training –Leverage external data to improve training –Employ filters to reduce false alarms Extensive evaluation over 114 sources in 6 domains –Core MAVERIC outperforms related work, improving F-1 by 4-19% –Enhancements further improve F-1 by 2-13%
Training the Core MAVERIC System Sensors learn internal profiles of data characteristics Combiner learns weight for each sensor smsm combiner …... s1s1 employ Winnow to learn weights avg value of price layout of attributes in HTML pages: price location beds / baths homeseekers.com on week 1 wrapper homeseekers.com on week 5 wrapper price location beds baths $185,000 “Urbana, IL” 2 2 $270,000 “Seattle, WA” 3 2 price location beds baths $132,000 “Salem, OR” 2 1 $365,000 “Atlanta, GA” 4 2
Verifying with the Core MAVERIC System Sensors leverage internal profiles to output sensor scores Combiner combines scores based upon weights price location beds baths 132 “Century 21” “RE/MAX” 2 4 homeseekers.com on week 6 wrapper smsm combiner …... s1s1 new avg price score 1 score m layout of attributes has changed alarm if combined score ≥ θ
Improving Training via Perturbation Idea: expand training data by generating synthetic data Simulate natural source changes during training –Source data changes, e.g., insert and delete tuples –Presentation format changes, e.g., $29.99 becomes USD source S at t 1 wrapper query results at t 1 source S at t n wrapper query results at t n smsm combiner …... s1s1 perturber - apply change - reapply wrapper - test results training data for S perturbed results original results System “practices ahead of time”
Example: Reformatting Price homeseekers.com wrapper $185,000 Urbana, IL 3bed/2bath… original HTML original results price location beds baths $185,000 “Urbana, IL” 3 2 wrapper 185,000 USD Urbana, IL 3bed/2bath… perturbed HTML perturbed results price location beds baths 185,000 USD “Urbana, IL” 3 2 training data ?=?= smsm combiner …... s1s1 perturbed training example perturbation original training example
Additional Improvements Improve training by borrowing data from other sources Reduce false alarms via filtering Web Search Engines: “price is 185,000 USD” “costs 185,000 USD” Other Sources: price 185,000 USD amount 210 K potentially corrupt attribute price is valid Monetary Recognizers: $185,000 $ house $185,000 source schema wrapper source schema wrapper mediated schema cost description S’ S “This…” 185,000 USD comments amount category price (see paper for details)
Empirical Evaluation Test verification ability over 114 sources in 6 domains Domain Number of Sources Schema Size (Number of Attributes) Probing Schedule Snapshots Correct Mappings Broken Mappings Flights198weekly for 10 weeks16426 Books216weekly for 12 weeks21042 Researchers604daily for 313 days Real Estate51711 snapshots per source3025 Inventory4711 snapshots per source2420 Courses51111 snapshots per source3025
Core MAVERIC Outperforms Prior Work Achieve F-1 from 82-93%, an improvement of 4-19% in all domains Domain Lerman SystemSensor Ensemble P / RF-1P / RF-1 Flights0.81 / / Books0.83 / / Researchers0.77 / / Real Estate0.45 / / Inventory0.52 / / Courses0.49 / / Compare with recent system [Lerman et al, Journal of AI Research 03]
Enhancements Boost Performance Each enhancement improved F-1 in at least 4 domains Progressively enhanced versions of MAVERIC Sensor Ensemble Sensor Ensemble + Perturbation Sensor Ensemble + Perturbation + Multi-Src Train Sensor Ensemble + Perturbation + Multi-Src Train + Filtering
Reasons for Mistakes Unrecognized instance formats –E.g., trained over TIME with format 2:00 pm, source changed format to 1400, output false alarm –E.g., trained over DAYS with format M-W-F, source changed format to Mon Wed Fri, output false alarm –Train with additional perturbations? Leverage more sources? Attributes with similar values –E.g., trained with ORDER-DATE before SHIP-DATE, source reversed order, missed alarm on reversed values (ORDER-DATE = 7/13/2004, SHIP-DATE = 7/4/2004) –Include additional domain constraints?
Related Work Schema matching –[Dhamankar et al, 04], [He & Chang, 03], [Kang & Naughton, 03], [Rahm & Bernstein, 01], [Doan, 01] –Quantify semantics to compute matching scores Activity monitoring –[Shavlik & Shavlik, 04], [Lazarevic et al, 03], [Stolfo et al, 01], [Fawcett & Provost, 99], [Allan et al, 98] –Profile normal behavior to detect notable events (e.g., intrusions) Mapping and wrapper maintenance –Wrapper verification: [Lerman et al, 03], [Kushmerick, 00] –Mapping and wrapper repair: [Velegrakis et al, 03], [Meng et al, 03], [Chidlovskii, 01]
Conclusion & Future Work Developed MAVERIC to reduce maintenance costs –An ensemble of sensors that exploit multiple characteristics of data Significantly improved core system –Perturbation, multi-source training, and filtering Extensively evaluated over 114 sources in 6 domains –Core outperformed related work, improving F-1 by 4-19% –Enhancements further improved F-1 by 2-13% Future work –Further improve and evaluate MAVERIC –Develop a solution for repairing broken mappings