Goal  Support “green field” profiling by providing a browsable, uniform representation of all data Strategy  Shred any source automatically.

Goal  Support “green field” profiling by providing a browsable, uniform representation of all data Strategy  Shred any source automatically using heuristics, conventions  Output is a set of triples (near-RDF)  Automatically ingest, cluster, and index these triples  Worry about coverage/completeness, but not cleanliness or semantics (yet)

 Shred "tabular" XML as Examples:  Shred text records as Examples:

What can we do with the shredded data? Browse Online –Interactive performance required Query –Simple API for rapid app development -- not full SPARQL Customize –Add, modify, clean the metadata Do Stuff –Interact with the outside world with Behaviors

Customization Add a property to a set of resources Collapse a path (not yet available)

Behaviors Browse to the set of resources you're interested, then do something with them –"Draw a map of all those resources that have a latitude and longitude" –"For each resource that has an email address, send a message" –"Draw a plot of all resources that have an 'x' property and a 'y' property" –"Export the current set of resources to Excel/MS Access/Postgresql" In General: –Call f(r) for each resource r in the working set that express properties p0, p1,..., –Call g(R) on the set of resources R in the working set that express properties p0, p1,...,

Back to the “source” property

Map behavior available for items with “latitude” and “longitude”

Back to the “source” property

How can we map historical events? Add properties latitude = $Lat longitude = $Lng

Now more resources are compatibl e with the map behavior

Other customization s allow configuring color and symbol on the map for each resource

curre nt historic al shade of red indicates 30 day temp (a red herring) arsonist? Better backgroun d layers exist!

Profiling with InfoSonde User proposes candidate property, such as FD or partitioning predicates System tests hypothesis –Confirmed: offers appropriate structural enhancement –Disconfirmed: displays violations, offers cleaning options

Example: FD-based Normalization current_events.xml first converted to relation (eventType, DTG, latitude, longitude, description) eventType appears to use a controlled vocabulary InfoSonde verifies DTG → eventType However, DTG is also a key Instead of normalization, InfoSonde offers to create an index

Enhancing Current Events Data Candidate property: fire descriptions include the place of origin, size, and damage estimate in a delimited string eventType in (‘Fire’, ‘WildFire’) ↔ description like ‘Fire started at *;* Acres;* Destroyed’ Enhancement: promote substructure to schema Expected result: geocoding place names allows mapping origins along with current positions Unexpected result: apparent misalignment of provided descriptions for two most recent fires

Current Position Reported Origin

Target schema and domain evolution Source metadata preparation Source data preparation Metadata matching Target data instantiation Transformation and analysis Hypothesis formation Determine the likely arson suspect - limited transportation - fondness for patterns “One person living at the center of a geographic pattern of historical and current incidents” Guess a task-specific schema {Name of person or incident: String, Incident cause: String, Location {street:string  P.O. street list, city:string  city list, zip: integer  P.O. list in area of interest, lat, long: float  in area of interest} } Find sources to fill in target Familiarization Clarify semantics and domains Current events (xml) Historical events(xls) Historical events(html) People (xls) Data assessment and profiling Find or build extension functions -spelling errors in cause field -“Twinford Drive” inconsistent with geo data -Geo data and park names switched on two other fires Extension functions ready: - split street, city, state, zip - street name to lat/long - zip code to lat/long - bin causes as “suspicious” or null - CSV  KML for map upload Map source schemas to target Fill in target relation, learning by example Remove extraneous data by projection De-duplicate entities & attributes Visualization Mapping of “inexpressible” data Theory formulation Verification Identify missing pieces T.name  People_info.name | historical_events.id | current_events.fire.dtg T.cause  bin(historical_events.cause) T.street,.city,.zip  split(People_info.address) T.lat,.long  getLatLong(split(People_info.address)) | current_events.fire.latitude,.longitude | historical_events.Lat,.Lng Create target instance using CHIME (acceleration via learning by example) Select people and suspicious events Project down to {name, lat, long} Resolve duplicate entities with CHIME Convert to CSV, then KML Load to Google Maps Set icon colors for visibility Make judgement about “pattern” An answer: Jimmy West

Arson Suspect: Target Schema and Solution Map

Rescue Order: Target Schema and Solution List John & Joan then Jenny then Jack

Target schema and domain evolution Source metadata preparation Source data preparation Metadata matching Target data instantiation Transformation and analysis Hypothesis formation A fly (or many?) in the ointment… Guess a task-specific schema - We don’t know how to compute or verify a task-specific schema automatically Find sources to fill in target Familiarization Clarify semantics and domains Data assessment and profiling Find or build extension functions Map source schemas to target Fill in target relation, learning by example Remove extraneous data by projection De-duplicate entities & attributes Visualization Mapping of “inexpressible” data Theory formulation Verification Identify missing pieces -Matching source to target requires semantic knowledge held only by humans - Partial attribute values and unstructured data lack semantics - must come from human knowledge (but the copy and paste action required is learnable by example!) -Entity & attribute resolution requires human-guided choices, e.g. “John and Joan Smith resolve to just one household, but which purchase year is right?” - Some things, like geometric and spatial recognition, require human interpretation - “What’s missing, where do I find it?” is not computable by a machine -No good language for semantics - Don’t know how to compute “Right” domains from source metadata - Data in diverse formats too complext for rapid human review -Need to determine keys and check FDs, then clean data - No language to describe semantics of inputs and outputs of extension functions  selection of functions cannot be automatic

Our approach: –Assist users in data familiarization data assessment and profiling Mapping entity/attribute resolution –Let human judgment make the call –Accelerate human effort via “learn-by-example” Our integration research projects –Quarry –Infosonde –CHIME

CHIME is… an information integration application to capture evolving human knowledge about task-specific data Evolving task-specific schema and entity sets Mapping diverse data to correct attributes and entities Learning by example to speed integration where possible Resolving entities and attributes, and recording user choices Navigating and revising the history of integration decisions made in a dataset

Information Integration Application Repository Pub/ Sub UI Mark API WebUI Editing/Markup Application Feed Browser Ontology Inference and Authoring Mark semantics review Schema creation Entity resolution Attribute resolution Literal mark creation Add to mark semantics Copy/ Paste Mark “headline” browsing Mark visit initiation Feed query creation View marks in context Mark semantics review Mark history review Mark creation Mark deployment to docs Context gathering Annotation Add to mark semantics Infer over mark semantics Mark browsing Mark searching Document submission 1 2 3 3 2 1 Populated schemas New and updated documents Ontologies and thesauri Mark browsing Mark searching Mark semantics review Doc API Share mark references CHIME CHIME is… Part of an architecture for capturing and sharing semantics, annotations, and usage of sub-document data

Metrics Scale-up improvement Scale-out improvement % of target schema successfully integrated % of identified user tasks automated or assisted % of data discrepancies detected, corrected automatically “cold-start” to “warm-start” time-to-solution ratio

Quarry Data Model resource, property, value –(subject, predicate, object) if you prefer no intrinsic distinction between literal values and resource values no explicit types or classes 38MSR 2008

Example: RxNorm 39 Concept Atom Relationship up to 23M triples describing 0.6M concepts and atoms 10001 NDC1 10001 ORIG_CODE123 userkey propvalue 10001 ingredient_of10004 10001 typeDC MSR 2008

Example: Metadata for Scientific Data Repository 40 7.5M triples describing 1M files …/anim-sal_estuary_7.gifdepth7 …/anim-sal_estuary_7.gifvariablesalt pathpropvalue …/anim-sal_estuary_7.gifregionestuary …/anim-sal_estuary_7.giftypeanim Region = “Estuary” Variable = “Salinity” Type = “Animation” Depth = “7” …/anim-sal_estuary_7.gif MSR 2008

41 SKIP MSR 2008

42MSR 2008

43MSR 2008

44MSR 2008

45MSR 2008

46MSR 2008

47MSR 2008

48MSR 2008

Quarry API 49 …/2004/2004-001/…/anim-tem_estuary_bottom.gif aggregate = bottom animation = isotem day = 001 directory = images plottype = isotem region = estuary runid = 2004-001 year = 2004 : …/2004/2004-001/…/amp_plume_2d.gif day = 001 directory = images plottype = 2d region = plume runid = 2004-001 year = 2004 Describe(k ey) Values(runid=2004-001, “plottype”) Properties(runid=2004- 001) MSR 2008

Behind the Scenes Signatures –resources possessing the same properties clustered together –Posit that |Signatures| << |Resources| –Queries evaluated over Signature Extents 51MSR 2008

Experimental Results Yet Another RDF Store Several B-Tree indexes to support spo, po  s, os  p, etc. ~3M triples We looked at multi-term queries 52 ?s : ?s MSR 2008

Experimental Results: Queries 53 3.6M triples 606k resources 149 signatures MSR 2008

Hands-off Operation Feed it triples –Calculates signatures –Computes signature extents Working on incremental facility for insertions –Resource can change signatures –New signatures can be created API doesn’t name tables 54MSR 2008

Goal  Support “green field” profiling by providing a browsable, uniform representation of all data Strategy  Shred any source automatically.

Similar presentations

Presentation on theme: "Goal  Support “green field” profiling by providing a browsable, uniform representation of all data Strategy  Shred any source automatically."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Goal  Support “green field” profiling by providing a browsable, uniform representation of all data Strategy  Shred any source automatically.

Similar presentations

Presentation on theme: "Goal  Support “green field” profiling by providing a browsable, uniform representation of all data Strategy  Shred any source automatically."— Presentation transcript:

Similar presentations

About project

Feedback