Visualization of Heterogeneous Data Mike Cammarano Xin (Luna) Dong Bryan Chan Jeff Klingner Justin Talbot Alon Halevy Pat Hanrahan
Homogeneous data is easy. CompanyFoundedHeadquartersLogo Microsoft N, W Enron N, 95.3 W Google N, W
Homogeneous data is easy. CompanyFoundedHeadquartersLogo Microsoft N, W Enron N, 95.3 W Google N, W
Homogeneous data is easy. CompanyFoundedHeadquartersLogo Microsoft N, W Enron N, 95.3 W Google N, W
Multiple sources? Collaborative content Semi-structured data {{Infobox Writer | bgcolour = silver | name = Edgar Allan Poe | image = Edgar_Allan_Poe_2.jpg | caption = This [[daguerreotype]] of Poe was taken in | birth_date = {{birth date|1809|1|19|mf=y}} | birth_place = [[Boston, Massachusetts]] [[United States|U.S.]] | death_date = {{death date and age|1849|10|07|1809|01|19}} | death_place = [[Baltimore, Maryland]] [[United States|U.S.]] | occupation = Poet, short story writer, editor, literary critic | movement = [[Romanticism]], [[Dark romanticism]] | genre = [[Horror fiction]], [[Crime fiction]], [[Detective fiction]] | magnum_opus = The Raven | spouse = [[Virginia Eliza Clemm Poe]]...
DBpedia.org DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. The DBpedia dataset currently provides information about more than 1.95 million “things”, including at least: 80,000 persons 70,000 places 35,000 music albums 12,000 films According to DBpedia.org:
Database size We use a subset of DBpedia, mostly infoboxes and geonames. 30 M triples 2.5 GB We currently use an in-memory database. Hardware is dual processor, dual core AMD opteron 280’s w/ 8GB RAM.
A glimpse inside DBpedia
Kerry: Poe: dbp: PLACE_OF_BIRTH dbp: latitude 39° 41´ 45˝ N dbp: birth_place w3c: owl#sameAs geonames: latitude
Heterogeneity Types Decimal vs. sexagesimal coordinates Names PLACE_OF_BIRTH vs. birth_place Paths dbp: PLACE_OF_BIRTH dbp: latitude vs. dbp: birth_place w3c: owl#sameAs geonames: latitude 39° 41´ 45˝ N39.70
Scenario / Demo
Vision: Self-configuring data
Contributions Visualize heterogeneous data represented as a graph of relationships between objects Describe inputs to a visualization: Visualization template Set of keywords per attribute Find attributes needed for a visualization by searching paths Within an iterative process of search, visualization, and refinement Present algorithm for finding and ranking paths based on keywords Efficiently enumerate paths A* Random sampling Rank according to: Keywords Heuristics about graph structure
Integrate searching and visualization Search for potentially desirable paths Refine path Visualize results selections in context
Matching problem Find the best path to a number for “state latitude” state capital latitude Dianne Feinstein 42.4 pop birth place spouse latitude 39.0 party house leader name color blue Harry Reid governor 4 children
state.capital.latitude state.pop spouse.birth_place.latitude state.governor.children state.capital.latitude state.pop spouse.birth_place.latitude state.governor.children Basic algorithm 1. Explore graph 2. Find paths ending in a number 3. Score and rank paths using TF/IDF Find the best path to a number for “state latitude” state capital latitude Dianne Feinstein 42.4 pop birth place spouse latitude 39.0 party house leader name color blue Harry Reid governor 4 children
Improving execution time New pruning techniques since the paper submission A* Bidirectional search on terms Random sampling
Pruning techniques state capital latitude Dianne Feinstein 42.4 pop birth place spouse latitude 39.0 party house leader name color blue Harry Reid governor 4 children Most paths do not correspond to a “state latitude” How can we avoid such bad paths? No mention of latitude Many unrelated terms No potential paths
state capital latitude Dianne Feinstein 42.4 pop birth place spouse latitude 39.0 party house leader name color blue Harry Reid governor 4 children Pruning techniques / A* Search Use a scoring function that penalizes unrelated terms Then an A* search ignores paths with many such terms Many unrelated terms
A* pruning results Senators on map Average # of edges examined at each depth, full enumeration: Average # of edges examined at each depth, using A*: 1234 Image Name latitude Image Name latitude
state capital latitude 42.4 pop birth place spouse latitude 39.0 party house leader name color blue Harry Reid governor 4 Dianne Feinstein children Pruning techniques / Random Sampling Do normal A* search for n randomly chosen nodes No potential paths A hit!
state capital latitude 42.4 pop birth place spouse latitude 39.0 party house leader name color blue Harry Reid governor 4 children Pruning techniques / Random Sampling Do normal A* search for n randomly chosen nodes No potential paths John Kerry Only search known hits for the remaining nodes Prevents repeatedly checking where there are likely no paths A hit!
Sampling results Average # edges examined at all depths: Total edges examined: without sampling 7360×99= with sampling 7360× ×89= Seed nodes (10)Others (89) Image92082 Name4035 State Latitude Longitude TOTAL
Performance Runtime for senators’ example: Runtime for astronauts’ example: Runtime for each field in countries’ example: Performance now interactive With new pruning techniques, ~100x faster than reported in paper. State latitudeState longitudeImageNameInstancestotal sec Mission launchMission insigniaNameInstancestotal sec GDP per capitaInflationFlagNameInstancestotal sec
Variations – senators’ flags versus birth places
Timeline of manned spaceflight
Scatterplot of inflation vs. GDP
Precision / Recall CorrectIncorrect 6434Accepted 10Rejected Senators – state latitude: CorrectIncorrect 20658Accepted 90Rejected Countries – gdp per capita: CorrectIncorrect 866Accepted 06Rejected Senators – image:
Summary Visualize heterogeneous data represented as a graph of relationships between objects Produce visualizations conforming to templates by searching for needed attributes Present algorithm for finding and ranking paths based on keywords Efficiently enumerate paths Rank Now fast enough for interactive use High precision and recall
Future work Improvements UI support for initial discovery and query refinement Robustness of terms / Improved ranking Automatic selection of visualization Visualizing missing data Visualizations that reflect result relevance (selective emphasis) Deploy on the web Wikipedia The whole web
Acknowledgements Funding sources: Boeing RVAC CALO Tools and data: DBpedia MIT SIMILE project timeline Tom Patterson’s map artwork
The end!
state capital latitude Dianne Feinstein 42.4 pop birth place spouse latitude 39.0 party house leader name color blue Harry Reid governor 4 children Pruning techniques Bidirectional Search Before A*, search one step back from each literal, following only edges that match keywords No mention of latitude This saves one step during forward A* search
Need for multiple paths