Download presentation
Presentation is loading. Please wait.
Published byTrevor Hawkins Modified over 9 years ago
1
Visualization of Heterogeneous Data Mike Cammarano Xin (Luna) Dong Bryan Chan Jeff Klingner Justin Talbot Alon Halevy Pat Hanrahan
2
Homogeneous data is easy. CompanyFoundedHeadquartersLogo Microsoft197547.6 N, 122.1 W Enron198529.7 N, 95.3 W Google199837.4 N, 122.0 W
3
Homogeneous data is easy. CompanyFoundedHeadquartersLogo Microsoft197547.6 N, 122.1 W Enron198529.7 N, 95.3 W Google199837.4 N, 122.0 W 1970 1980 1990 2000 1975 1985 1998
4
Homogeneous data is easy. CompanyFoundedHeadquartersLogo Microsoft197547.6 N, 122.1 W Enron198529.7 N, 95.3 W Google199837.4 N, 122.0 W 1970 1980 1990 2000
5
Multiple sources? Collaborative content Semi-structured data {{Infobox Writer | bgcolour = silver | name = Edgar Allan Poe | image = Edgar_Allan_Poe_2.jpg | caption = This [[daguerreotype]] of Poe was taken in 1848... | birth_date = {{birth date|1809|1|19|mf=y}} | birth_place = [[Boston, Massachusetts]] [[United States|U.S.]] | death_date = {{death date and age|1849|10|07|1809|01|19}} | death_place = [[Baltimore, Maryland]] [[United States|U.S.]] | occupation = Poet, short story writer, editor, literary critic | movement = [[Romanticism]], [[Dark romanticism]] | genre = [[Horror fiction]], [[Crime fiction]], [[Detective fiction]] | magnum_opus = The Raven | spouse = [[Virginia Eliza Clemm Poe]]...
6
DBpedia.org DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. The DBpedia dataset currently provides information about more than 1.95 million “things”, including at least: 80,000 persons 70,000 places 35,000 music albums 12,000 films According to DBpedia.org:
7
Database size We use a subset of DBpedia, mostly infoboxes and geonames. 30 M triples 2.5 GB We currently use an in-memory database. Hardware is dual processor, dual core AMD opteron 280’s w/ 8GB RAM.
8
A glimpse inside DBpedia
9
Kerry: Poe: dbp: PLACE_OF_BIRTH dbp: latitude 39° 41´ 45˝ N dbp: birth_place w3c: owl#sameAs geonames: latitude 42.358403
10
Heterogeneity Types Decimal vs. sexagesimal coordinates Names PLACE_OF_BIRTH vs. birth_place Paths dbp: PLACE_OF_BIRTH dbp: latitude vs. dbp: birth_place w3c: owl#sameAs geonames: latitude 39° 41´ 45˝ N39.70
11
Scenario / Demo
18
Vision: Self-configuring data
19
Contributions Visualize heterogeneous data represented as a graph of relationships between objects Describe inputs to a visualization: Visualization template Set of keywords per attribute Find attributes needed for a visualization by searching paths Within an iterative process of search, visualization, and refinement Present algorithm for finding and ranking paths based on keywords Efficiently enumerate paths A* Random sampling Rank according to: Keywords Heuristics about graph structure
20
Integrate searching and visualization Search for potentially desirable paths Refine path Visualize results selections in context
21
Matching problem Find the best path to a number for “state latitude” state capital latitude Dianne Feinstein 42.4 pop 6349000 birth place spouse latitude 39.0 party house leader name color blue Harry Reid governor 4 children
22
state.capital.latitude state.pop spouse.birth_place.latitude state.governor.children state.capital.latitude state.pop spouse.birth_place.latitude state.governor.children Basic algorithm 1. Explore graph 2. Find paths ending in a number 3. Score and rank paths using TF/IDF Find the best path to a number for “state latitude” state capital latitude Dianne Feinstein 42.4 pop 6349000 birth place spouse latitude 39.0 party house leader name color blue Harry Reid 0.8 0.5 0.6 0.5 governor 4 children
23
Improving execution time New pruning techniques since the paper submission A* Bidirectional search on terms Random sampling
24
Pruning techniques state capital latitude Dianne Feinstein 42.4 pop 6349000 birth place spouse latitude 39.0 party house leader name color blue Harry Reid governor 4 children Most paths do not correspond to a “state latitude” How can we avoid such bad paths? No mention of latitude Many unrelated terms No potential paths
25
state capital latitude Dianne Feinstein 42.4 pop 6349000 birth place spouse latitude 39.0 party house leader name color blue Harry Reid governor 4 children Pruning techniques / A* Search Use a scoring function that penalizes unrelated terms Then an A* search ignores paths with many such terms Many unrelated terms
26
A* pruning results Senators on map Average # of edges examined at each depth, full enumeration: Average # of edges examined at each depth, using A*: 1234 Image6620491615198 Name6695092228 latitude6659822722148 1234 Image6654091342261393766 Name6654461686735245035 latitude6654081455491009247
27
state capital latitude 42.4 pop 6349000 birth place spouse latitude 39.0 party house leader name color blue Harry Reid governor 4 Dianne Feinstein children Pruning techniques / Random Sampling Do normal A* search for n randomly chosen nodes No potential paths A hit!
28
state capital latitude 42.4 pop 6349000 birth place spouse latitude 39.0 party house leader name color blue Harry Reid governor 4 children Pruning techniques / Random Sampling Do normal A* search for n randomly chosen nodes No potential paths John Kerry Only search known hits for the remaining nodes Prevents repeatedly checking where there are likely no paths A hit!
29
Sampling results Average # edges examined at all depths: Total edges examined: without sampling 7360×99= 728640 with sampling 7360×10 + 580×89= 125220 Seed nodes (10)Others (89) Image92082 Name4035 State200175 Latitude3100144 Longitude3100144 TOTAL7360580
30
Performance Runtime for senators’ example: Runtime for astronauts’ example: Runtime for each field in countries’ example: Performance now interactive With new pruning techniques, ~100x faster than reported in paper. State latitudeState longitudeImageNameInstancestotal 0.9110.8540.5420.5130.1873.007 sec Mission launchMission insigniaNameInstancestotal 1.1091.1510.7431.1024.105 sec GDP per capitaInflationFlagNameInstancestotal 1.1422.2280.8671.1081.1366.481 sec
31
Variations – senators’ flags versus birth places
32
Timeline of manned spaceflight
33
Scatterplot of inflation vs. GDP
34
Precision / Recall CorrectIncorrect 6434Accepted 10Rejected Senators – state latitude: CorrectIncorrect 20658Accepted 90Rejected Countries – gdp per capita: CorrectIncorrect 866Accepted 06Rejected Senators – image:
35
Summary Visualize heterogeneous data represented as a graph of relationships between objects Produce visualizations conforming to templates by searching for needed attributes Present algorithm for finding and ranking paths based on keywords Efficiently enumerate paths Rank Now fast enough for interactive use High precision and recall
36
Future work Improvements UI support for initial discovery and query refinement Robustness of terms / Improved ranking Automatic selection of visualization Visualizing missing data Visualizations that reflect result relevance (selective emphasis) Deploy on the web Wikipedia The whole web
37
Acknowledgements Funding sources: Boeing RVAC CALO Tools and data: DBpedia MIT SIMILE project timeline Tom Patterson’s map artwork
38
The end!
39
state capital latitude Dianne Feinstein 42.4 pop 6349000 birth place spouse latitude 39.0 party house leader name color blue Harry Reid governor 4 children Pruning techniques Bidirectional Search Before A*, search one step back from each literal, following only edges that match keywords No mention of latitude This saves one step during forward A* search
40
Need for multiple paths
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.