Download presentation
Presentation is loading. Please wait.
Published byCora Atkinson Modified over 9 years ago
12
Goal Support “green field” profiling by providing a browsable, uniform representation of all data Strategy Shred any source automatically using heuristics, conventions Output is a set of triples (near-RDF) Automatically ingest, cluster, and index these triples Worry about coverage/completeness, but not cleanliness or semantics (yet)
13
Shred "tabular" XML as Examples: Shred text records as Examples:
14
What can we do with the shredded data? Browse Online –Interactive performance required Query –Simple API for rapid app development -- not full SPARQL Customize –Add, modify, clean the metadata Do Stuff –Interact with the outside world with Behaviors
15
Customization Add a property to a set of resources Collapse a path (not yet available)
16
Behaviors Browse to the set of resources you're interested, then do something with them –"Draw a map of all those resources that have a latitude and longitude" –"For each resource that has an email address, send a message" –"Draw a plot of all resources that have an 'x' property and a 'y' property" –"Export the current set of resources to Excel/MS Access/Postgresql" In General: –Call f(r) for each resource r in the working set that express properties p0, p1,..., –Call g(R) on the set of resources R in the working set that express properties p0, p1,...,
17
Back to the “source” property
18
Map behavior available for items with “latitude” and “longitude”
19
Back to the “source” property
20
How can we map historical events? Add properties latitude = $Lat longitude = $Lng
22
Now more resources are compatibl e with the map behavior
23
Other customization s allow configuring color and symbol on the map for each resource
24
curre nt historic al shade of red indicates 30 day temp (a red herring) arsonist? Better backgroun d layers exist!
26
Profiling with InfoSonde User proposes candidate property, such as FD or partitioning predicates System tests hypothesis –Confirmed: offers appropriate structural enhancement –Disconfirmed: displays violations, offers cleaning options
27
Example: FD-based Normalization current_events.xml first converted to relation (eventType, DTG, latitude, longitude, description) eventType appears to use a controlled vocabulary InfoSonde verifies DTG → eventType However, DTG is also a key Instead of normalization, InfoSonde offers to create an index
28
Enhancing Current Events Data Candidate property: fire descriptions include the place of origin, size, and damage estimate in a delimited string eventType in (‘Fire’, ‘WildFire’) ↔ description like ‘Fire started at *;* Acres;* Destroyed’ Enhancement: promote substructure to schema Expected result: geocoding place names allows mapping origins along with current positions Unexpected result: apparent misalignment of provided descriptions for two most recent fires
29
Current Position Reported Origin
30
Target schema and domain evolution Source metadata preparation Source data preparation Metadata matching Target data instantiation Transformation and analysis Hypothesis formation Determine the likely arson suspect - limited transportation - fondness for patterns “One person living at the center of a geographic pattern of historical and current incidents” Guess a task-specific schema {Name of person or incident: String, Incident cause: String, Location {street:string P.O. street list, city:string city list, zip: integer P.O. list in area of interest, lat, long: float in area of interest} } Find sources to fill in target Familiarization Clarify semantics and domains Current events (xml) Historical events(xls) Historical events(html) People (xls) Data assessment and profiling Find or build extension functions -spelling errors in cause field -“Twinford Drive” inconsistent with geo data -Geo data and park names switched on two other fires Extension functions ready: - split street, city, state, zip - street name to lat/long - zip code to lat/long - bin causes as “suspicious” or null - CSV KML for map upload Map source schemas to target Fill in target relation, learning by example Remove extraneous data by projection De-duplicate entities & attributes Visualization Mapping of “inexpressible” data Theory formulation Verification Identify missing pieces T.name People_info.name | historical_events.id | current_events.fire.dtg T.cause bin(historical_events.cause) T.street,.city,.zip split(People_info.address) T.lat,.long getLatLong(split(People_info.address)) | current_events.fire.latitude,.longitude | historical_events.Lat,.Lng Create target instance using CHIME (acceleration via learning by example) Select people and suspicious events Project down to {name, lat, long} Resolve duplicate entities with CHIME Convert to CSV, then KML Load to Google Maps Set icon colors for visibility Make judgement about “pattern” An answer: Jimmy West
31
Arson Suspect: Target Schema and Solution Map
32
Rescue Order: Target Schema and Solution List John & Joan then Jenny then Jack
33
Target schema and domain evolution Source metadata preparation Source data preparation Metadata matching Target data instantiation Transformation and analysis Hypothesis formation A fly (or many?) in the ointment… Guess a task-specific schema - We don’t know how to compute or verify a task-specific schema automatically Find sources to fill in target Familiarization Clarify semantics and domains Data assessment and profiling Find or build extension functions Map source schemas to target Fill in target relation, learning by example Remove extraneous data by projection De-duplicate entities & attributes Visualization Mapping of “inexpressible” data Theory formulation Verification Identify missing pieces -Matching source to target requires semantic knowledge held only by humans - Partial attribute values and unstructured data lack semantics - must come from human knowledge (but the copy and paste action required is learnable by example!) -Entity & attribute resolution requires human-guided choices, e.g. “John and Joan Smith resolve to just one household, but which purchase year is right?” - Some things, like geometric and spatial recognition, require human interpretation - “What’s missing, where do I find it?” is not computable by a machine -No good language for semantics - Don’t know how to compute “Right” domains from source metadata - Data in diverse formats too complext for rapid human review -Need to determine keys and check FDs, then clean data - No language to describe semantics of inputs and outputs of extension functions selection of functions cannot be automatic
34
Our approach: –Assist users in data familiarization data assessment and profiling Mapping entity/attribute resolution –Let human judgment make the call –Accelerate human effort via “learn-by-example” Our integration research projects –Quarry –Infosonde –CHIME
35
CHIME is… an information integration application to capture evolving human knowledge about task-specific data Evolving task-specific schema and entity sets Mapping diverse data to correct attributes and entities Learning by example to speed integration where possible Resolving entities and attributes, and recording user choices Navigating and revising the history of integration decisions made in a dataset
36
Information Integration Application Repository Pub/ Sub UI Mark API WebUI Editing/Markup Application Feed Browser Ontology Inference and Authoring Mark semantics review Schema creation Entity resolution Attribute resolution Literal mark creation Add to mark semantics Copy/ Paste Mark “headline” browsing Mark visit initiation Feed query creation View marks in context Mark semantics review Mark history review Mark creation Mark deployment to docs Context gathering Annotation Add to mark semantics Infer over mark semantics Mark browsing Mark searching Document submission 1 2 3 3 2 1 Populated schemas New and updated documents Ontologies and thesauri Mark browsing Mark searching Mark semantics review Doc API Share mark references CHIME CHIME is… Part of an architecture for capturing and sharing semantics, annotations, and usage of sub-document data
37
Metrics Scale-up improvement Scale-out improvement % of target schema successfully integrated % of identified user tasks automated or assisted % of data discrepancies detected, corrected automatically “cold-start” to “warm-start” time-to-solution ratio
38
Quarry Data Model resource, property, value –(subject, predicate, object) if you prefer no intrinsic distinction between literal values and resource values no explicit types or classes 38MSR 2008
39
Example: RxNorm 39 Concept Atom Relationship up to 23M triples describing 0.6M concepts and atoms 10001 NDC1 10001 ORIG_CODE123 userkey propvalue 10001 ingredient_of10004 10001 typeDC MSR 2008
40
Example: Metadata for Scientific Data Repository 40 7.5M triples describing 1M files …/anim-sal_estuary_7.gifdepth7 …/anim-sal_estuary_7.gifvariablesalt pathpropvalue …/anim-sal_estuary_7.gifregionestuary …/anim-sal_estuary_7.giftypeanim Region = “Estuary” Variable = “Salinity” Type = “Animation” Depth = “7” …/anim-sal_estuary_7.gif MSR 2008
41
41 SKIP MSR 2008
42
42MSR 2008
43
43MSR 2008
44
44MSR 2008
45
45MSR 2008
46
46MSR 2008
47
47MSR 2008
48
48MSR 2008
49
Quarry API 49 …/2004/2004-001/…/anim-tem_estuary_bottom.gif aggregate = bottom animation = isotem day = 001 directory = images plottype = isotem region = estuary runid = 2004-001 year = 2004 : …/2004/2004-001/…/amp_plume_2d.gif day = 001 directory = images plottype = 2d region = plume runid = 2004-001 year = 2004 Describe(k ey) Values(runid=2004-001, “plottype”) Properties(runid=2004- 001) MSR 2008
50
API Clients Applications use sequences of Prop and Val calls to explore the Dataspace 50 runid year week region | plume | far | surface runid year | 2003 | 2004 | 2005 show products… week region | estuary plottype variable region year | surface MSR 2008
51
Behind the Scenes Signatures –resources possessing the same properties clustered together –Posit that |Signatures| << |Resources| –Queries evaluated over Signature Extents 51MSR 2008
52
Experimental Results Yet Another RDF Store Several B-Tree indexes to support spo, po s, os p, etc. ~3M triples We looked at multi-term queries 52 ?s : ?s MSR 2008
53
Experimental Results: Queries 53 3.6M triples 606k resources 149 signatures MSR 2008
54
Hands-off Operation Feed it triples –Calculates signatures –Computes signature extents Working on incremental facility for insertions –Resource can change signatures –New signatures can be created API doesn’t name tables 54MSR 2008
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.