Visualization of Heterogeneous Data

Visualization of Heterogeneous Data
Mike Cammarano Xin (Luna) Dong Bryan Chan Jeff Klingner Justin Talbot Alon Halevy Pat Hanrahan Good morning, all. My name is Mike Cammarano, and I’m pleased to present this work on behalf of multiple collaborators at Stanford University as well as at Google and AT&T Labs. I’ll be describing our paper, “Visualization of Heterogeneous Data.” In order to introduce this concept, let’s first consider the familiar scenario of visualizing homogeneous data.

Homogeneous data is easy.
Company Founded Headquarters Logo Microsoft 1975 47.6 N, W Enron 1985 29.7 N, W Google 1998 37.4 N, W A straightforward example of homogeneous data is a relational database table. Every column has a known data type. Every row in the table has the same structure as all the other rows. That is, every row has a set of identically named and typed fields. Simple visualizations are formed by systematically mapping fields onto visual attributes. For example, consider drawing a timeline.

Company Founded Headquarters Logo Microsoft 1975 47.6 N, W Enron 1985 29.7 N, W Google 1998 37.4 N, W 1975 1985 1998 This involves mapping a column containing dates onto the horizontal position of marks within the visual representation.

Company Founded Headquarters Logo Microsoft 1975 47.6 N, W Enron 1985 29.7 N, W Google 1998 37.4 N, W The situation is similar for geographic visualization. Columns of longitude and latitude are mapped onto the x and y coordinates of marks. In practice, you might need to JOIN several tables of a relational database. However, the result of that JOIN is still a table, neatly divided into consistently named and typed columns.

Multiple sources? Collaborative content Semi-structured data
{{Infobox Writer | bgcolour = silver | name = Edgar Allan Poe | image = Edgar_Allan_Poe_2.jpg | caption = This [[daguerreotype]] of Poe was taken in | birth_date = {{birth date|1809|1|19|mf=y}} | birth_place = [[Boston, Massachusetts]] [[United States|U.S.]] | death_date = {{death date and age|1849|10|07|1809|01|19}} | death_place = [[Baltimore, Maryland]] [[United States|U.S.]] | occupation = Poet, short story writer, editor, literary critic | movement = [[Romanticism]], [[Dark romanticism]] | genre = [[Horror fiction]], [[Crime fiction]], [[Detective fiction]] | magnum_opus = The Raven | spouse = [[Virginia Eliza Clemm Poe]] ... For many applications, we may want to integrate semi-structured data that comes from multiple independent sources. Wikipedia is a great example of collaborative content where we find a rich and interesting collection of data from multiple authors. Especially interesting is the semi-structured data found in the infoboxes, which accompany many articles. The underlying representation of the Infoboxes in wikisource consist of attribute/value pairs. Note that the values may include wikilinks, which are references to other Wikipedia entries. These connections among infoboxes form a graph.

DBpedia.org According to DBpedia.org:
DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. The DBpedia dataset currently provides information about more than 1.95 million “things”, including at least: 80,000 persons 70,000 places 35,000 music albums 12,000 films Our entrypoint to this data is through a project called DBpedia. DBpedia extracts structured content from Wikipedia and converts it into RDF – the resource description framework used in semantic web technologies. This conversion means that the wikilinks are expanded to URIs, and all the object and attribute names from wikipedia are put within a URI namespace. The resulting DBpedia dataset can be combined with other RDF graphs, like the geonames database of geographical information. We are not affiliated with either Wikipedia or DBpedia. We’re just end users. We use the RDF graphs provided by DBpedia as our primary data source.

Database size We use a subset of DBpedia, mostly infoboxes and geonames. 30 M triples 2.5 GB We currently use an in-memory database. Hardware is dual processor, dual core AMD opteron 280’s w/ 8GB RAM. The subset of dbpedia we use includes the infoboxes and geonames collections. Combined, this amounts to about 30 million triples, occupying about 2.5 GB. We’re currently using an in-memory database.

A glimpse inside DBpedia
Here is a tiny subgraph of DBpedia consisting of six subjects: two people and four places. Each subject is shown as a box. Within each box, several attribute/value pairs are shown. Literal objects, like strings and decimals, are shown in place. Meanwhile, references to other objects are drawn as arrows. From this diagram, the graph structure of the data should be clear.

A glimpse inside DBpedia
Now, suppose we wanted to obtain geocoordinates of public figures’ birthplaces in order to visualize them on a map. What paths within the graph need to be explored? For Kerry, we begin by following the dbp: P.O.B. association to the entry for Aurora. Then, we lookup the value of the dbp: latitude attribute. For Poe, we first follow the dbp: b.p. association to dbp: Boston. Next we follow the owl: sameAs association to the Boston entry in the geonames data. Finally we lookup the geo: latitude attribute. Retrieving “the same field” (birth location) for each of these people required traversing completely different sequences of predicates – different paths through the graph. The paths were different lengths. They involved different predicates, and they led to results of different data types – one is a string encoding, and one a decimal value. Let me just say that there is nothing special about Wikipedia and also that there is nothing unusually perverse about this specific example. This is meant to be representative of the heterogeneity issues that typically arise when combining data from multiple sources. We’ve chosen to work with Wikipedia/DBpedia because it is a large, familiar, readily understandable data set that clearly illustrates the heterogeneity characteristics we want to address. Kerry: Poe: dbp: PLACE_OF_BIRTH dbp: latitude 39° 41´ 45˝ N dbp: birth_place w3c: owl#sameAs geonames: latitude

Heterogeneity Types Decimal vs. sexagesimal coordinates Names
PLACE_OF_BIRTH vs. birth_place Paths dbp: PLACE_OF_BIRTH dbp: latitude vs. dbp: birth_place w3c: owl#sameAs geonames: latitude 39° 41´ 45˝ N 39.70 To summarize, we’ve seen three different varieties of heterogeneity: Mismatched data types, inconsistent field/predicate names, and finally heterogeneity of PATHS. I’ll briefly note that many data type issues arise when string representations are used in place of numeric or date types. To address this, we preprocess the dbpedia data in an attempt to parse common string representations of dates and geographic coordinates. When such a string is found, we augment the graph by adding a new attribute with the appropriate datatype, derived from the string form. The majority of our work, however, is focused on handling heterogeneity of names and paths. The observation that motivates us is that although corresponding paths may involve different predicates, there may nonetheless be significant textual similarity between them. In this case, we see that the terms ‘birth’ and ‘latitude’ are present in both paths, even though none of the predicates are identical.

Scenario / Demo At this point, I’m going to step through screenshots from a simple example. I’ll show how we use our interface to construct a visualization by querying the database of Wikipedia entries. Wikipedia articles are organized into categories. We begin by selecting a category to visualize.

Scenario / Demo In this case, let’s select the category of incumbent members of the U.S. Senate. Next, we’ll choose one of several visualization templates that we wish to apply to this collection. Let’s use the geographic view.

Scenario / Demo Having selected the map visualization, we are presented with a template showing the fields needed to construct it. To put items on a map, our system will attempt to find decimal values for latitude and longitude, an image, and a text caption. These four fields are necessary input for the chosen visualization. However, the mapping between these visualization fields and the underlying data is extremely flexible. In fact, this interface allows us to construct arbitrary keyword queries describing what features of the data we want to map to each visual attribute. Some generic default keywords are filled in automatically.

Scenario / Demo Suppose we click search while using the default keywords. Our system then searches the underlying data graph in the neighborhood of each senator, trying to find paths that match the query terms we specified. The best matches are then plotted on the map. We can also switch to a tabular view of this data.

Scenario / Demo A limitation of the generic default query is that people are often associated with multiple places – their birthplace, their current residence, and so forth. At this point, the system is retrieving an arbitrary mix of these. Let’s suppose that for this collection of senators, we’re particularly interested in mapping the states they represent. So we’ll make the latitude and longitude queries more specific:

Scenario / Demo … by saying that we particularly care about state capitals, and repeating the search.

Scenario / Demo Again, we’ll return to a tabular view.
This shows the best match found for each field, as well as a description of where it was found in the graph. One thing we can learn from this view is that the values used didn’t all come from the same place. Some of the geocoordinates were obtained from a state-capital entry in the graph … but others were retrieved from state-largest city. As we saw from the sample graph shown earlier, we often need to follow different paths to retrieve analogous information.

Vision: Self-configuring data
Let’s talk about the big picture for a moment. The grand motivating vision here is that a casual user should be able to gather several databases from independent sources, pour them all into one big repository, and have them automatically link up and cross-reference with each other. Self-configuring data. Furthermore, it should be easy to drag bundles of information from this unified information space onto common visual representations like maps, timelines, etc, and have it just work. We don’t want to force the user to undertake an expensive data integration step upfront. We want to defer this cost, and only perform data integration on-demand. The approach we’ve adopted is to tightly integrate the search for synonymous paths into exploratory visualization.

Contributions Visualize heterogeneous data represented as a graph of relationships between objects Describe inputs to a visualization: Visualization template Set of keywords per attribute Find attributes needed for a visualization by searching paths Within an iterative process of search, visualization, and refinement Present algorithm for finding and ranking paths based on keywords Efficiently enumerate paths A* Random sampling Rank according to: Keywords Heuristics about graph structure We make the following specific contributions: We pose the visualization of heterogeneous data as a search problem. We describe a method for specifying the inputs to a visualization as keyword queries. We describe an algorithm that searches the space of possible paths, and ranks them according to the visualization queries.

Integrate searching and visualization
Search for potentially desirable paths Refine path Visualize results selections in context As we’ve seen, the user specifies the kinds of paths desired in terms of textual keywords and fairly generic data types. Our algorithm searches the graph for corresponding path instances, and ranks them according to textual similarity as well as several heuristics about graph structure. Because this search is imprecise, the user will sometimes need to manually refine the rankings of paths in difficult cases. This iterated process of search, refinement, and discovery is how the user makes sense of a large dataset.

Matching problem Find the best path to a number for “state latitude”
42.4 capital governor children 4 pop state spouse birth place latitude 39.0 Dianne Feinstein Let’s walk through an example of searching for paths in the graph. I’ll go over our algorithm using an example for one senator and one attribute - From all the possible paths starting from Dianne Feinstein, we want the one that leads to a numerical value and best matches the keywords “state latitude” This is then repeated for each senator in the visualization Note that the graph here is a simplified subset. In the real wikipedia data, nodes tend to have at least a dozen edges, most of which do not lead to a state latitude. So imagine for each inappropriate edge here, that there is actually a branch of many such bad edges in the real graph. party Harry Reid name house leader color blue

3. Score and rank paths using TF/IDF 2. Find paths ending in a number
Basic algorithm Find the best path to a number for “state latitude” latitude 42.4 state.capital.latitude state.pop spouse.birth_place.latitude state.governor.children state.capital.latitude state.pop spouse.birth_place.latitude state.governor.children 0.8 0.5 0.6 capital governor children 4 pop state spouse birth place latitude 39.0 Dianne Feinstein In the basic algorithm, we: Explore graph up to a certain depth, 3 in this case Keep the paths that end in the type we want, a number Give each path a score based on how closely the terms in the path match the keywords, weighted by TF/IDF And in this case we get state capital latitude as the highest scoring path party Harry Reid name house leader color blue 3. Score and rank paths using TF/IDF 2. Find paths ending in a number 1. Explore graph

Improving execution time
New pruning techniques since the paper submission A* Bidirectional search on terms Random sampling This naïve algorithm really suffers from the combinatorial explosion in # of paths. Our original performance numbers reported in the paper were quite slow. I’ll take this opportunity to describe a few optimizations we’ve subsequently tried.

Pruning techniques No mention of latitude Many unrelated terms
Most paths do not correspond to a “state latitude” How can we avoid such bad paths? state capital latitude Dianne Feinstein 42.4 pop birth place spouse 39.0 party house leader name color blue Harry Reid governor 4 children No mention of latitude Exhaustive enumeration of paths considered many that didn’t match query terms at all. So, we can apply an A* search with a scoring function that penalizes non-matching terms. Many unrelated terms No potential paths

Pruning techniques / A* Search
Use a scoring function that penalizes unrelated terms Then an A* search ignores paths with many such terms latitude 42.4 capital governor children 4 pop state Once some reasonable paths are found, the A* search can ignore paths that start with too many unrelated terms spouse birth place Many unrelated terms latitude 39.0 Dianne Feinstein party Harry Reid name house leader color blue

A* pruning results Senators on map 1 2 3 4 Image 66 5409 134226
Average # of edges examined at each depth, full enumeration: Average # of edges examined at each depth, using A*: 1 2 3 4 Image 66 5409 134226 Name 5446 168673 latitude 5408 145549 When searching longer paths, length 3 or 4, this substantially reduces the number of paths to consider. Here are some statistics about the effectiveness of A* on the senators-on-a-map example. The top table shows number of edges examined at each depth with the naïve algorithm. Bottom table shows number of edges examined when we prune with A*. 1 2 3 4 Image 66 2049 1615 198 Name 9 5092 228 latitude 598 2272 2148

Pruning techniques / Random Sampling
Do normal A* search for n randomly chosen nodes latitude A hit! 42.4 capital governor children 4 pop state Another optimization we’ve added is to only apply the full search to a sparse sampling of the seed nodes. Remember, we’re performing our search for each member of a collection – each of 100 senators, say. Since there can be a lot of similarity in the local graph structure around each one, we end up repeatedly exploring a lot of similar graphs. Instead, we’ll randomly select a subset of the senators, and run A* on those as before. Keep track of the successful paths. Now, switch to a different senator … ** Draw attention to the switch from Feinstein to Kerry (Might also mention that their local graphs will probably Be slightly different) Intuition is that that the needed attribute can usually always be found along a small handful of fairly common paths. Choosing n large enough to hit most of these paths means all the other nodes can run a much shorter search. spouse birth place latitude 39.0 Dianne Feinstein party Harry Reid name No potential paths house leader color blue

Pruning techniques / Random Sampling
Do normal A* search for n randomly chosen nodes Only search known hits for the remaining nodes Prevents repeatedly checking where there are likely no paths latitude A hit! 42.4 capital governor children 4 pop state Rather than repeating the full A* search, only test for the existence of paths already found to be high scoring. If we find one, just reuse it. (Might also mention that their local graphs will probably Be slightly different) Intuition is that that the needed attribute can usually always be found along a small handful of fairly common paths. Choosing n large enough to hit most of these paths means all the other nodes can run a much shorter search. spouse birth place latitude 39.0 John Kerry party Harry Reid name No potential paths house leader color blue

Sampling results Seed nodes (10) Others (89) Image 920 82 Name 40 35
Average # edges examined at all depths: Total edges examined: without sampling 7360×99 = with sampling 7360× ×89 = Seed nodes (10) Others (89) Image 920 82 Name 40 35 State 200 175 Latitude 3100 144 Longitude TOTAL 7360 580 Again, looking at statistics from senators example, this reduces the number of edges examined.

Performance Runtime for senators’ example:
Runtime for astronauts’ example: Runtime for each field in countries’ example: Performance now interactive With new pruning techniques, ~100x faster than reported in paper. State latitude State longitude Image Name Instances total 0.911 0.854 0.542 0.513 0.187 3.007 sec Mission launch Mission insignia Name Instances total 1.109 1.151 0.743 1.102 4.105 sec GDP per capita Inflation Flag Name Instances total 1.142 2.228 0.867 1.108 1.136 6.481 sec Here are some timings with these pruning techniques enabled. Searches now take seconds instead of minutes.

Variations – senators’ flags versus birth places
(Visualization doesn’t impose semantics – mapping is flexible using keywords) Exploration! Incremental change of a visualization shows you something new. Answering analytical questions – birth state vs. state represented

Timeline of manned spaceflight

Scatterplot of inflation vs. GDP
… Come see us after the talk if you want to see the system live.

Precision / Recall Senators – image: Senators – state latitude:
Correct Incorrect 86 6 Accepted Rejected Senators – state latitude: Correct Incorrect 64 34 Accepted 1 Rejected 2x2 diagram for each of 3-4 fields, each of 2 examples Countries – gdp per capita: Correct Incorrect 206 58 Accepted 9 Rejected

Summary Visualize heterogeneous data represented as a graph of relationships between objects Produce visualizations conforming to templates by searching for needed attributes Present algorithm for finding and ranking paths based on keywords Efficiently enumerate paths Rank Now fast enough for interactive use High precision and recall So, once again: We pose the visualization of heterogeneous graphs of data as a search problem. We describe a method for specifying the inputs to a visualization as keyword queries. We describe an algorithm that searches the space of possible paths, and ranks them according to the visualization queries.

Future work Improvements
UI support for initial discovery and query refinement Robustness of terms / Improved ranking Automatic selection of visualization Visualizing missing data Visualizations that reflect result relevance (selective emphasis) Deploy on the web Wikipedia The whole web There’s a lot of neat things to do with this in the future! End with vision for the future Clearly reinforce Think Big!

Acknowledgements Funding sources: Boeing RVAC CALO Tools and data: DBpedia MIT SIMILE project timeline Tom Patterson’s map artwork

The end!

Pruning techniques Bidirectional Search
Before A*, search one step back from each literal, following only edges that match keywords This saves one step during forward A* search latitude 42.4 capital governor children No mention of latitude 4 pop state During the forward A* search, this saves one extra step at the end. spouse birth place latitude 39.0 Dianne Feinstein party Harry Reid name house leader color blue

Need for multiple paths
Results from live system How many different paths need to be followed?

Visualization of Heterogeneous Data

Similar presentations

Presentation on theme: "Visualization of Heterogeneous Data"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Visualization of Heterogeneous Data

Similar presentations

Presentation on theme: "Visualization of Heterogeneous Data"— Presentation transcript:

Similar presentations

About project

Feedback