Presentation is loading. Please wait.

Presentation is loading. Please wait.

Target schema and domain evolution Source metadata preparation Source data preparation Metadata matching Target data instantiation Transformation and analysis.

Similar presentations


Presentation on theme: "Target schema and domain evolution Source metadata preparation Source data preparation Metadata matching Target data instantiation Transformation and analysis."— Presentation transcript:

1 Target schema and domain evolution Source metadata preparation Source data preparation Metadata matching Target data instantiation Transformation and analysis Hypothesis formation Determine the likely arson suspect - limited transportation - fondness for patterns “One person living at the center of a geographic pattern of historical and current incidents” Guess a task-specific schema {Name of person or incident: String, Incident cause: String, Location {street:string  P.O. street list, city:string  city list, zip: integer  P.O. list in area of interest, lat, long: float  in area of interest} } Find sources to fill in target Familiarization Clarify semantics and domains Current events (xml) Historical events(xls) Historical events(html) People (xls) Data assessment and profiling Find or build extension functions -spelling errors in cause field -“Twinford Drive” inconsistent with geo data -Geo data and park names switched on two other fires Extension functions ready: - split street, city, state, zip - street name to lat/long - zip code to lat/long - bin causes as “suspicious” or null - CSV  KML for map upload Map source schemas to target Fill in target relation, learning by example Remove extraneous data by projection De-duplicate entities & attributes Visualization Mapping of “inexpressible” data Theory formulation Verification Identify missing pieces T.name  People_info.name | historical_events.id | current_events.fire.dtg T.cause  bin(historical_events.cause) T.street,.city,.zip  split(People_info.address) T.lat,.long  getLatLong(split(People_info.address)) | current_events.fire.latitude,.longitude | historical_events.Lat,.Lng Create target instance using CHIME (acceleration via learning by example) Select people and suspicious events Project down to {name, lat, long} Resolve duplicate entities with CHIME Convert to CSV, then KML Load to Google Maps Set icon colors for visibility Make judgement about “pattern” An answer: Jimmy West

2 Arson Suspect: Target Schema and Solution Map

3 Rescue Order: Target Schema and Solution List John & Joan then Jenny then Jack

4 Target schema and domain evolution Source metadata preparation Source data preparation Metadata matching Target data instantiation Transformation and analysis Hypothesis formation A fly (or many?) in the ointment… Guess a task-specific schema - We don’t know how to compute or verify a task-specific schema automatically Find sources to fill in target Familiarization Clarify semantics and domains Data assessment and profiling Find or build extension functions Map source schemas to target Fill in target relation, learning by example Remove extraneous data by projection De-duplicate entities & attributes Visualization Mapping of “inexpressible” data Theory formulation Verification Identify missing pieces -Matching source to target requires semantic knowledge held only by humans - Partial attribute values and unstructured data lack semantics - must come from human knowledge (but the copy and paste action required is learnable by example!) -Entity & attribute resolution requires human-guided choices, e.g. “John and Joan Smith resolve to just one household, but which purchase year is right?” - Some things, like geometric and spatial recognition, require human interpretation - “What’s missing, where do I find it?” is not computable by a machine -No good language for semantics - Don’t know how to compute “Right” domains from source metadata - Data in diverse formats too complext for rapid human review -Need to determine keys and check FDs, then clean data - No language to describe semantics of inputs and outputs of extension functions  selection of functions cannot be automatic

5 Our approach: –Assist users in data familiarization data assessment and profiling Mapping entity/attribute resolution –Let human judgment make the call –Accelerate human effort via “learn-by-example” Our integration research projects –Quarry –Infosonde –CHIME

6 CHIME is… an information integration application to capture evolving human knowledge about task-specific data Evolving task-specific schema and entity sets Mapping diverse data to correct attributes and entities Learning by example to speed integration where possible Resolving entities and attributes, and recording user choices Navigating and revising the history of integration decisions made in a dataset

7 Information Integration Application Repository Pub/ Sub UI Mark API WebUI Editing/Markup Application Feed Browser Ontology Inference and Authoring Mark semantics review Schema creation Entity resolution Attribute resolution Literal mark creation Add to mark semantics Copy/ Paste Mark “headline” browsing Mark visit initiation Feed query creation View marks in context Mark semantics review Mark history review Mark creation Mark deployment to docs Context gathering Annotation Add to mark semantics Infer over mark semantics Mark browsing Mark searching Document submission 1 2 3 3 2 1 Populated schemas New and updated documents Ontologies and thesauri Mark browsing Mark searching Mark semantics review Doc API Share mark references CHIME CHIME is… Part of an architecture for capturing and sharing semantics, annotations, and usage of sub-document data

8 Metrics Scale-up improvement Scale-out improvement % of target schema successfully integrated % of identified user tasks automated or assisted % of data discrepancies detected, corrected automatically “cold-start” to “warm-start” time-to-solution ratio


Download ppt "Target schema and domain evolution Source metadata preparation Source data preparation Metadata matching Target data instantiation Transformation and analysis."

Similar presentations


Ads by Google